I have a custom mapping that seems like it should be discarding the text, based on the documentation here: https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-standard-tokenizer.html :
max_token_length
The maximum token length. If a token is seen that exceeds this length then it is discarded. Defaults to 255.
This shows that the text actually gets split, however the documentation of the standard tokenizer indicates it should be discarded. It seems like it's doing what the standard analyzer indicates it does with large tokens https://www.elastic.co/guide/en/elasticsearch/reference/2.2/analysis-standard-analyzer.html :
max_token_length
The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.
Well that makes more sense, @johtani is there a way to replicate the older functionality with the current version, some combination of filters perhaps?
I see that I can specify an older version of Lucene on the tokenizer/analyzer, but I wonder what other ramifications that has? Really what we want is to discard really large tokens. Any suggestions? Thanks!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.