Java code to strip HTML and do Unique?

We have fields that are created just for indexing, and we would like to strip the HTML and make it unique, before we put it into the field. I was thinking maybe we can just use the tokenizer filter directly to do this in Java - but couldn't make heads or tails on how to use that API ?


So for example:

< p > Hello there < /p > -> Hello There.

This is a repeat repeat repeat -> This is a repeat

We want to store those in the _source field that way and save a ton of disk space. Basically we are good with storing it in _source but want to remove the HTML and do unique tokenization and then string it together and put into the normal PUT.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.