Pattern for Indexing HTML Documents

Alex_Egg · June 26, 2017, 10:49pm

I'm looking to index a very large amount of relatively small sized HTML and XML documents. I would, of course, like to query them using search and to receive back highlights in the search results. From what I can gather this is the best practice for this:

Ingest Process

Get original document (HTML or XML)
Strip tags/markup using Apache Tika
Index content (sans/tags) and reference to original document (url, etc) in ES

Search Process

Query string for full text search in document body
ES will search the cleaned documents (sans tags)
Returns highlight in cleaned document (no HTML fragments)
If user wants to see original document, click on reference link (step 3 in ingest).

Does this seem reasonable?

Notes:

The reason why I don't want to use the Ingest Attachment plugin (which uses Tika in background) is b/c it requires you to essentially double your storage by saving a base64 blob of your document.

dadoonet · June 28, 2017, 9:05am

A good practice is to use also a remove processor in your ingest pipeline to remove the field you don't want to keep. (And I agree that in most cases it's useless to keep it around).

May be we should have a "remove" option available OOTB BTW.
I opened:

github.com/elastic/elasticsearch

Add remove field to all processors

opened 09:12AM - 28 Jun 17 UTC

closed 01:07PM - 30 Jun 17 UTC

dadoonet

discuss :Data Management/Ingest Node

In a similar way as we have in Logstash, I think we should be able to always pro…vide a `remove` attribute in processors so people can simply remove a field without having to write "complex" pipelines to remove something that has been parsed. For example instead of writing: ```json { "processors": [ { "grok": { "field": "message", "patterns": ["%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"] } },{ "remove": { "field": "message" } } ] } ``` We can write: ```json { "processors": [ { "grok": { "field": "message", "patterns": ["%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"], "remove": ["message"] } } ] } ``` Same applies to ingest-attachment where normally you don't want to keep the original BASE64 content in your docs. ```json { "processors" : [ { "attachment" : { "field" : "data" } },{ "remove" : { "field" : "data" } } ] } ``` to ```json { "processors" : [ { "attachment" : { "field" : "data", "remove" : [ "data" ] } } ] } ```

Alex_Egg · June 28, 2017, 5:24pm

Thanks for your response. My outline above is now in action and it seems to be working well. Now my only problem is that I have a collection of documents that I am indexing that all have a master. Now I need to figure out how to model that so that in my search results if I find a child document, it can show a link to the master. I suppose I will post a new question since this is about modeling.

system · July 26, 2017, 5:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best way to use Ingest Attachment Plugin Elasticsearch ingest-pipeline	4	514	December 31, 2021
Ingest attachment plugin not analysing some html files Elasticsearch	15	1207	March 30, 2018
How can I strip text from a document before it goes into the index Elasticsearch	4	618	December 27, 2017
Ingesting HTML file into elasticsearch Elasticsearch	6	5002	June 29, 2017
How to use html_strip in an attachment pipeline? Elasticsearch	4	456	April 17, 2020

Pattern for Indexing HTML Documents

Related topics