Java code to strip HTML and do Unique?

billnbell · April 11, 2017, 9:27pm

We have fields that are created just for indexing, and we would like to strip the HTML and make it unique, before we put it into the field. I was thinking maybe we can just use the tokenizer filter directly to do this in Java - but couldn't make heads or tails on how to use that API ?

Thanks!

billnbell · April 11, 2017, 9:29pm

So for example:

< p > Hello there < /p > -> Hello There.

This is a repeat repeat repeat -> This is a repeat

We want to store those in the _source field that way and save a ton of disk space. Basically we are good with storing it in _source but want to remove the HTML and do unique tokenization and then string it together and put into the normal PUT.

system · May 9, 2017, 9:34pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to use html_strip Char filter? Elasticsearch	5	1841	July 6, 2017
Adding html_strip filter Elasticsearch	6	331	December 27, 2022
Storing the html stripped version of a document in elasticsearch Elasticsearch	4	3669	September 26, 2017
Unique token filter with string array Elasticsearch	1	620	December 6, 2017
How can I index HTML tags Elasticsearch	4	512	March 22, 2021

Java code to strip HTML and do Unique?

Related topics