Ingest attachment plugin not analysing some html files

aniket_savanand · February 28, 2018, 5:24am

Hi

I'm using elasticsearch ingest attachment plugin to ingest HTML files into the Elasticsearch 6.1.3. Whenever I am seeing contents of my attachment, some HTML files are getting parsed(removing html tag and all) but some HTML files I see are present, as it is.

I want to know why it must be doing that.?
I tried changing mappings adding different analyzers but results are same.

This attachment mapping I'm using right now.

"attachment": {
			"properties": {
				"content": {
					"type": "text",
					"analyzer": "some_anaylzer",
					"term_vector": "with_positions_offsets",
					"index_options": "offsets",
					"index": true
				}
			}
},

dadoonet · February 28, 2018, 12:20pm

You want to keep intact the html content?
If so don't use ingest attachment but just send the html content in a json field.

aniket_savanand · February 28, 2018, 7:33pm

Thanks for your reply.
Can you direct me how do I store content in a JSON field? Do I need to convert HTML file as a string for the field? Any java library method which would help me to do that?

dadoonet · February 28, 2018, 7:46pm

Something like:

{
  "foo":"<html>bar</html>"
}

aniket_savanand · February 28, 2018, 7:59pm

So if its a big HTML document I am ingesting, I just need to convert it into the string and store it. is that right? (I'm using elasticsearch client to ingest data from java code).

dadoonet · February 28, 2018, 8:17pm

Yes.

But the main question is: what is your use case ?

aniket_savanand · February 28, 2018, 8:24pm

I have bunch of html files to ingest into the elasticsearch. Then I want to make search by keyword available within in those documents.

dadoonet · February 28, 2018, 9:49pm

Keyword ?

Do you mean by keywords you can have in an html header? Or something else?

aniket_savanand · February 28, 2018, 10:23pm

Keyword would be a text directly from HTML file, not from headers. Text is present as part of paragraph or inside a table in the html doucment.

dadoonet · February 28, 2018, 10:42pm

But why do you want to keep the HTML markups then?

I guess that if you want to index <p>Foo</p>, you probably want to be able to search for Foo right?

In which case I don't understand what was wrong in the first place with what ingest-attachment is producing.

aniket_savanand · February 28, 2018, 10:56pm

I guess that if you want to index <p>Foo</p>, you probably want to be able to search for Foo right?

Yes.

I wanted to keep HTML documents in the original format on UI side. If analyser analyses HTML file, the in UI, HTML file's text gets scrambled. So it doesn't look appropriate in UI when scrolling through scrambled HTML file.

Previously, ingest-attachment were analysing some html files and some were not. So some HTML files in UI were presented without any change in original format/style, and in some it was removing HTML tags.

dadoonet · February 28, 2018, 11:25pm

So let's sum up your real needs:

You need something to search for text inside an HTML page
You need something to render the page itself

2 needs here. Different needs. Different ways to solve that.

The 1st one: search.

You need to index the text as it will be the most efficient for the search engine to search for data. The ingest-attachment will do that by just extracting the text. That's what you need here.

The 2nd one: render a webpage

Well. Let's say you fetch a webpage from http://foo.bar/page.html. You can just index a document like:

{
  "text": "BASE64 here",
  "url": "http://foo.bar/page.html"
}

Rendering the page is "just" opening this URL.

Another way to do it:

{
  "text": "BASE64 here"
}

How to render this? Just decode the BASE64 on the fly on the client side and you should be OK.

Another way to do it:

{
  "text": "BASE64 here",
  "html": "<html>foo</html>"
}

You need to send 2 different things here: the BASE64 version so ingest-attachment will be able to parse it and the html content itself.

And a 3rd way to do both:

Send something like:

{
  "html": "<html>foo</html>"
}

But change the analyzer for the html field to use an html_strip character filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html

You should be good I think. I'd first try this 3rd solution and see how it goes.

aniket_savanand · March 1, 2018, 12:43am

OK. Let me look into these solutions.
Thanks.

aniket_savanand · March 1, 2018, 8:13pm

I am trying 3rd approach. Seeing this error when

com.vistalytics.utils.HtmlFilesReader  - Error while reading text files Document contains at least one immense term in field="document" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped.  Please correct the analyzer to not produce such terms.  The prefix of the first immense term is: '[10, 10, 10, 49, 48, 45, 81, 10, 10, 49, 10, 10, 100, 56, 51, 53, 53, 51, 51, 100, 49, 48, 113, 46, 104, 116, 109, 10, 10, 49]...', original message: bytes can be at most 32766 in length; got 226027

This is my analyzer

	 "analysis": {
	  "analyzer": {
	    "custom_analyzer": {
	      "tokenizer": "keyword",
	      "char_filter": ["html_strip"]
	    }
	  }
	}

and my mapping are like,

	"properties": {
		"document": {
			"type": "text",
			"analyzer": "custom_analyzer",
			"fielddata": true
		},
		"documentName": {
			"type": "text",
			"fielddata": true
		},

What am I missing here. Working on this now.

dadoonet · March 2, 2018, 1:02am

Why do you want to use a keyword analyzer on such big text?

system · March 30, 2018, 1:02am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Ingesting HTML file into elasticsearch Elasticsearch	6	5002	June 29, 2017
Searching attachment content with ingest attachment plugin ES 5.2 Elasticsearch	8	5414	March 13, 2017
Ingest attachment - missing attachment field in results Elasticsearch	4	777	November 9, 2017
Can ingest-attachment-plugin reads all the contents of attachment? Elasticsearch	4	334	April 8, 2019
How Attachments or file storage and searching is handled in Elasticsearch Elasticsearch	7	1439	August 13, 2020

Ingest attachment plugin not analysing some html files

The 1st one: search.

The 2nd one: render a webpage

And a 3rd way to do both:

Related topics