I'm looking to index a very large amount of relatively small sized HTML and XML documents. I would, of course, like to query them using search and to receive back highlights in the search results. From what I can gather this is the best practice for this:
Ingest Process
Get original document (HTML or XML)
Strip tags/markup using Apache Tika
Index content (sans/tags) and reference to original document (url, etc) in ES
Search Process
Query string for full text search in document body
ES will search the cleaned documents (sans tags)
Returns highlight in cleaned document (no HTML fragments)
If user wants to see original document, click on reference link (step 3 in ingest).
Does this seem reasonable?
Notes:
The reason why I don't want to use the Ingest Attachment plugin (which uses Tika in background) is b/c it requires you to essentially double your storage by saving a base64 blob of your document.
A good practice is to use also a remove processor in your ingest pipeline to remove the field you don't want to keep. (And I agree that in most cases it's useless to keep it around).
May be we should have a "remove" option available OOTB BTW.
I opened:
Thanks for your response. My outline above is now in action and it seems to be working well. Now my only problem is that I have a collection of documents that I am indexing that all have a master. Now I need to figure out how to model that so that in my search results if I find a child document, it can show a link to the master. I suppose I will post a new question since this is about modeling.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.