A reboot of the CogUmbracoExamineMediaIndexer package.
Hello, youve stumbled into the old Cogblog archives
We've switched our blogging focus to our new Innerworks content, where tech community members can share inspiring stories, content, and top tips. Because of this, old Cogworks blogs will soon be deleted from the site, so enjoy it while you can. A few of these might refer to images in the text, but those have been deleted already sorry. Some of these subjects will return, some aren't relevant anymore, and some just don't fit the unbiased community initiative of Innerworks.
If you'd like to take on this subject yourself please submit a new blog!
Farewell Cogworks Blog š
When working with Examine, you will eventually have a customer requirement for indexing non-html content, i.e files in the media section. The default Examine offering is PDF indexer which, as the name suggests, can be used to index media files provided they are PDF format.
However, you may want to also index other file types like Word, Powerpoint or Excel, so what to do in this case? To address this issue, around 2012 I authored the CogUmbracoExamineMediaIndexer which to date has had 1600 downloads. The package targeted Umbraco v6.
I thought that the package could do with a refresh, and so The Cogworks is proud to present the reboot: Cogworks.ExamineFileIndexer.
This custom Examine indexer utilises Apache Tika. Apache Tika is a library that is used for document type detection and content extraction from various file formats. Internally, Tika uses existing various document parsers and document type detection techniques to detect and extract data.
Using Tika, one can develop a universal type detector and content extractor to extract both structured text, as well as metadata, from different types of documents, such as: spreadsheets, text documents, images, PDFs and even multimedia input formats to a certain extent.
Tika provides a single generic API for parsing different file formats. It uses existing specialised parser libraries for each document type.
As you can see from the Luke screenshot below, Tika will extract more than just the file content, it will also extract quite a lot of useful metadata:
Tika internally also has a language detector class, so it can determine the language the document is written in. You can even do crazy things like detect phone numbers in the text.
More examples can be found on the Tika samples page (we may incorporate some of this other cool stuff in a later version), but if you really want to get into Tika there is a great book called Tika in action.
The package can be installed via nuget Install-Package Cogworks.ExamineFileIndexer
Upon installation ExamineIndex.config and ExamineSettings.config file will be updated. The following entries will be added:
ExamineIndex.config
<IndexSet SetName="MediaIndexSet" IndexPath="~/App_Data/TEMP/ExamineIndexes/MediaIndexSet">
<IndexAttributeFields>
<add Name="id" />
<add Name="nodeName" />
<add Name="updateDate" />
<add Name="writerName" />
<add Name="path" />
<add Name="nodeTypeAlias" />
<add Name="parentID" />
</IndexAttributeFields>
<IncludeNodeTypes>
<add Name="File" />
</IncludeNodeTypes>
</IndexSet>
And
ExamineSettings.config
Under ExamineIndexProviders/providers:
<add name="MediaIndexer" type="Cogworks.ExamineFileIndexer.UmbracoMediaFileIndexer, Cogworks.ExamineFileIndexer"
extensions=".pdf,.docx"
umbracoFileProperty="umbracoFile" />
Under ExamineSearchProviders/providers:
<add name="MediaSearcher" type="UmbracoExamine.UmbracoExamineSearcher, UmbracoExamine" indexSet="MediaIndexSet"
analyzer="Lucene.Net.Analysis.Standard.StandardAnalyzer, Lucene.Net" />
By default, the following file types will be indexed: pdf and docx. To add other file types to index you need to update ExamineSettings.config and add the extension to the extensions attribute:
<add name="MediaIndexer" type="Cogworks.ExamineFileIndexer.UmbracoMediaFileIndexer, Cogworks.ExamineFileIndexer"
extensions=".pdf,.docx"
umbracoFileProperty="umbracoFile" />
The package also supports virtual path providers. So, if you are storing media files in the cloud (e.g Azure, using the package UmbracoFileSystemProviders.Azure) then they will still be indexed.
You can install via nuget which is the preferred way, or you can install the Umbraco package on our.umbraco.org. Source code is available on github, so feel free to log any issues or fork and improve via pull request (contributions welcome, the VPP functionality was a contribution from Crumpled Dog made by Hendy Racher).
So go forth and index those files!
Did you know that media indexing is covered on the Umbraco Searching and Indexing course? To find an event near you take a look at the Umbraco course schedule page.
- Media Indexer Package
- Umbraco Package
- Cogworks Package
Innerworks and Cogworks are proud to partner with Community TechAid who aim to enable sustainable access to technology and skills needed to ensure digital inclusion for all. Any support you can give is hugely appreciated.