Pattern for Indexing HTML Documents

I'm looking to index a very large amount of relatively small sized HTML and XML documents. I would, of course, like to query them using search and to receive back highlights in the search results. From what I can gather this is the best practice for this:

Ingest Process

  1. Get original document (HTML or XML)
  2. Strip tags/markup using Apache Tika
  3. Index content (sans/tags) and reference to original document (url, etc) in ES

Search Process

  1. Query string for full text search in document body
  2. ES will search the cleaned documents (sans tags)
  3. Returns highlight in cleaned document (no HTML fragments)
  4. If user wants to see original document, click on reference link (step 3 in ingest).

Does this seem reasonable?

Notes:

The reason why I don't want to use the Ingest Attachment plugin (which uses Tika in background) is b/c it requires you to essentially double your storage by saving a base64 blob of your document.

A good practice is to use also a remove processor in your ingest pipeline to remove the field you don't want to keep. (And I agree that in most cases it's useless to keep it around).

May be we should have a "remove" option available OOTB BTW.
I opened:

Thanks for your response. My outline above is now in action and it seems to be working well. Now my only problem is that I have a collection of documents that I am indexing that all have a master. Now I need to figure out how to model that so that in my search results if I find a child document, it can show a link to the master. I suppose I will post a new question since this is about modeling.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.