The fields are long UTF-8 text fields. What all will be the performance / other difference between these two fields, assuming that the size of contents in these 2 fields are same. What all operations are possible / not possible?
Why would you need a single token be up to 10922 characters in length?
Users are unlikely to perform exact-match searches by typing in a 10k length string
Users are unlikely to select a 10k length string from a drop-down box in a structured form
Users are unlikely to want to perform aggregations where a bar on a bar chart is labelled with a 10k length string.
The only scenario I can conceive of is where something like a long URL is stored in a system and you want to perform some analysis on URL usage. You're likely still going to end up with a cripplingly large number of unique strings in your index. Hashes may be a better approach in these circumstances
My intention was just to search inside those long text fields. Not the entire text, but keywords inside that. In this case, kindly tell me is it OK to go with the below config;
That's perhaps the misinterpretation then. It's "keyword" singular, not plural. A keyword string is treated as a single untokenized keyword like science fiction as opposed to a text string which is tokenized into the words science and fiction.
Nope. It dictates the maximum length of a string that will be stored as an untokenized keyword.
"Keyword" does not mean substring. The process of creating substrings for search is tokenization to chop string fields of the type "text" into individual tokens.
Strings of the type keyword are not tokenized.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.