Ingest attachment plugin not analysing some html files

Hi

I'm using elasticsearch ingest attachment plugin to ingest HTML files into the Elasticsearch 6.1.3. Whenever I am seeing contents of my attachment, some HTML files are getting parsed(removing html tag and all) but some HTML files I see are present, as it is.

I want to know why it must be doing that.?
I tried changing mappings adding different analyzers but results are same. :frowning:

This attachment mapping I'm using right now.

"attachment": {
			"properties": {
				"content": {
					"type": "text",
					"analyzer": "some_anaylzer",
					"term_vector": "with_positions_offsets",
					"index_options": "offsets",
					"index": true
				}
			}
},

You want to keep intact the html content?
If so don't use ingest attachment but just send the html content in a json field.

Thanks for your reply.
Can you direct me how do I store content in a JSON field? Do I need to convert HTML file as a string for the field? Any java library method which would help me to do that?

Something like:

{
  "foo":"<html>bar</html>"
}

So if its a big HTML document I am ingesting, I just need to convert it into the string and store it. is that right? (I'm using elasticsearch client to ingest data from java code).

Yes.

But the main question is: what is your use case ?

I have bunch of html files to ingest into the elasticsearch. Then I want to make search by keyword available within in those documents.

Keyword ?

Do you mean by keywords you can have in an html header? Or something else?

Keyword would be a text directly from HTML file, not from headers. Text is present as part of paragraph or inside a table in the html doucment.

But why do you want to keep the HTML markups then?

I guess that if you want to index <p>Foo</p>, you probably want to be able to search for Foo right?

In which case I don't understand what was wrong in the first place with what ingest-attachment is producing.

I guess that if you want to index <p>Foo</p>, you probably want to be able to search for Foo right?

Yes.

I wanted to keep HTML documents in the original format on UI side. If analyser analyses HTML file, the in UI, HTML file's text gets scrambled. So it doesn't look appropriate in UI when scrolling through scrambled HTML file.

Previously, ingest-attachment were analysing some html files and some were not. So some HTML files in UI were presented without any change in original format/style, and in some it was removing HTML tags.

So let's sum up your real needs:

  • You need something to search for text inside an HTML page
  • You need something to render the page itself

2 needs here. Different needs. Different ways to solve that.

The 1st one: search.

You need to index the text as it will be the most efficient for the search engine to search for data. The ingest-attachment will do that by just extracting the text. That's what you need here.

The 2nd one: render a webpage

Well. Let's say you fetch a webpage from http://foo.bar/page.html. You can just index a document like:

{
  "text": "BASE64 here",
  "url": "http://foo.bar/page.html"
}

Rendering the page is "just" opening this URL.

Another way to do it:

{
  "text": "BASE64 here"
}

How to render this? Just decode the BASE64 on the fly on the client side and you should be OK.

Another way to do it:

{
  "text": "BASE64 here",
  "html": "<html>foo</html>"
}

You need to send 2 different things here: the BASE64 version so ingest-attachment will be able to parse it and the html content itself.

And a 3rd way to do both:

Send something like:

{
  "html": "<html>foo</html>"
}

But change the analyzer for the html field to use an html_strip character filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

You should be good I think. I'd first try this 3rd solution and see how it goes.

OK. Let me look into these solutions.
Thanks.

I am trying 3rd approach. Seeing this error when

com.vistalytics.utils.HtmlFilesReader  - Error while reading text files Document contains at least one immense term in field="document" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[10, 10, 10, 49, 48, 45, 81, 10, 10, 49, 10, 10, 100, 56, 51, 53, 53, 51, 51, 100, 49, 48, 113, 46, 104, 116, 109, 10, 10, 49]...', original message: bytes can be at most 32766 in length; got 226027

This is my analyzer

	 "analysis": {
	  "analyzer": {
	    "custom_analyzer": {
	      "tokenizer": "keyword",
	      "char_filter": ["html_strip"]
	    }
	  }
	}

and my mapping are like,

	"properties": {
		"document": {
			"type": "text",
			"analyzer": "custom_analyzer",
			"fielddata": true
		},
		"documentName": {
			"type": "text",
			"fielddata": true
		},

What am I missing here. Working on this now.

Why do you want to use a keyword analyzer on such big text?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.