We have fields that are created just for indexing, and we would like to strip the HTML and make it unique, before we put it into the field. I was thinking maybe we can just use the tokenizer filter directly to do this in Java - but couldn't make heads or tails on how to use that API ?
Thanks!
So for example:
< p > Hello there < /p > -> Hello There.
This is a repeat repeat repeat -> This is a repeat
We want to store those in the _source field that way and save a ton of disk space. Basically we are good with storing it in _source but want to remove the HTML and do unique tokenization and then string it together and put into the normal PUT.