WhiteSpaceTokenizer buffer_size

My analysis conf as follows,
index :
analysis :
analyzer :
default_index :
type : custom
tokenizer : whitespace
filter : [ word_delimiter, snowball, lowercase]
default_search :
type : custom
tokenizer : whitespace
filter : [ word_delimiter, snowball, lowercase]
filter :
word_delimiter :
type : word_delimiter
preserve_original : true
split_on_numerics : true
stem_english_possessive : false

My input text of length 793 chars contains "1675333000000088066"
(character position as 754-773) .

I wish to search the import id 1675333000000088066 but its not found . Then i paste my whole input string in kopf analysis plugin in that the entire string in processed as 256 characters chunk.Thus 1675333000000088066 is splitted in to 2 regions(3rd 256 chunk & 4th 256 chunk)

1st region 0 - 255
2nd region 255 - 510
3rd region 510 - 765 contains 16753330000
4th region 765-793 contains 00088066

On google it, i found this link

But this is for keyword tokenizer. How will i say to take the entire string instead of 256 char chunk while tokenizing through whitespace tokenizer ? Or any other suggestions ?

Lucene's char-based tokenizers impose a hard-wired 256 length limit for words, see also https://issues.apache.org/jira/browse/LUCENE-5785

See MAX_WORD_LEN in org.apache.lucene.analysis.util.CharTokenizer

Use pattern tokenizer https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-tokenizer.html with a whitespace character pattern as a workaround. It reads the whole input of the document for the tokenizer into RAM.


Thanks for your reply.


We replaced whitespace tokenizer by pattern tokenizer (pattern as whitespace).
Now while indexing we are hitting the max term size problem , got the following exception from logs

IllegalArgumentException[Document contains at least one immense term in field="message" (whose UTF8 encoding is longer than the max length 32766), all of which were skipped. Please correct the analyzer to not produce such terms. The prefix of the first immense term is: '[100, 111, 98, 106, 61, 61, 61, 61, 62, 123, 34, 108, 111, 99, 34, 58, 91, 123, 34, 100, 97, 116, 97, 34, 58, 123, 34, 117, 114, 108]...', original message: bytes can be at most 32766 in length; got 149542]; nested: MaxBytesLengthExceededException[bytes can be at most 32766 in length; got 149542];

any suggestions please ? FYI : we got nearly 50000 exceptions.

@jprante any suggestions ? also datanodes were severely affected by memory issues. jstat -gcutil shows constantly more than 95% memory , gc running continuously . Thus reverted back to whitespace tokenizer.

There is a setting on the string type that will through out tokens larger
than some size. That is your best bet to contain the exceptions but you are
accepting that they won't be in the index. I expect no one will find them
anyway because they are too long. I bet you could truncate them with a
token filter too but I don't know that offhand.