We have fields that are created just for indexing, and we would like to strip the HTML and make it unique, before we put it into the field. I was thinking maybe we can just use the tokenizer filter directly to do this in Java - but couldn't make heads or tails on how to use that API ?
This is a repeat repeat repeat -> This is a repeat
We want to store those in the _source field that way and save a ton of disk space. Basically we are good with storing it in _source but want to remove the HTML and do unique tokenization and then string it together and put into the normal PUT.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.