My analysis conf as follows,
index :
analysis :
analyzer :
default_index :
type : custom
tokenizer : whitespace
filter : [ word_delimiter, snowball, lowercase]
default_search :
type : custom
tokenizer : whitespace
filter : [ word_delimiter, snowball, lowercase]
filter :
word_delimiter :
type : word_delimiter
preserve_original : true
split_on_numerics : true
stem_english_possessive : false
My input text of length 793 chars contains "1675333000000088066"
(character position as 754-773) .
I wish to search the import id 1675333000000088066 but its not found . Then i paste my whole input string in kopf analysis plugin in that the entire string in processed as 256 characters chunk.Thus 1675333000000088066 is splitted in to 2 regions(3rd 256 chunk & 4th 256 chunk)
1st region 0 - 255
2nd region 255 - 510
3rd region 510 - 765 contains 16753330000
4th region 765-793 contains 00088066
But this is for keyword tokenizer. How will i say to take the entire string instead of 256 char chunk while tokenizing through whitespace tokenizer ? Or any other suggestions ?
We replaced whitespace tokenizer by pattern tokenizer (pattern as whitespace).
Now while indexing we are hitting the max term size problem , got the following exception from logs
IllegalArgumentException[Document contains at least one immense term in field="message" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[100, 111, 98, 106, 61, 61, 61, 61, 62, 123, 34, 108, 111, 99, 34, 58, 91, 123, 34, 100, 97, 116, 97, 34, 58, 123, 34, 117, 114, 108]...', original message: bytes can be at most 32766 in length; got 149542]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 149542];
any suggestions please ? FYI : we got nearly 50000 exceptions.
@jprante any suggestions ? also datanodes were severely affected by memory issues. jstat -gcutil shows constantly more than 95% memory , gc running continuously . Thus reverted back to whitespace tokenizer.
There is a setting on the string type that will through out tokens larger
than some size. That is your best bet to contain the exceptions but you are
accepting that they won't be in the index. I expect no one will find them
anyway because they are too long. I bet you could truncate them with a
token filter too but I don't know that offhand.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.