Ingesting HTML file into elasticsearch


(s) #1

Hi,

Is there any way to ingest html file into elasticsearch? So far I have seen a command to ingest json files.

I have a file which contains html content different html tags etc. Requirement is to store some content as a String and some tags as it is in the elasticsearch.


(David Pilato) #2

You can use html trip char filter.
You can also look at ingest-attachment plugin.


(s) #3

Thanks @dadoonet
I saw the plugin document and installed it but I could not understand how the file should be provided in that API

In document it is shown as
PUT my_index/my_type/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}

Ho can I specify my file? Do I need to specify file path somewhere?


(David Pilato) #4

You need to encode the content of the file in BASE64 and add this to The json document.

You can also look at FSCrawler project if you wish. Might help


(s) #5

Thanks @dadoonet : Encoding with BASE64 worked. It stores the actual text in the index. Is there any way if I want to store it with actual HTML tags. I wanted to store few tags like table tag as it is in the index whereas rest of the data can be stored by eliminating html tags.


(David Pilato) #6

Then you need to send as well the HTML content within another field. Which can be not indexed


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.