NGram Index implications

Hi,

I'd like to use NGram for my indexes using the autocomplete analyzer.
my only concern is the implications on index ram/disk size.
is there a rule of thumb of how much the index consumes in case of -

  1. "min_gram": 1,
    "max_gram": 5

  2. "min_gram": 1,
    "max_gram": 10

  3. "min_gram": 1,
    "max_gram": 15

  4. "min_gram": 1,
    "max_gram": 20

is it a multiplication or something else?

I'd suggest the edge ngram as it would produce less tokens and is generally good enough for autocomplete. As for size with plain ngram it's not double going from 5 to 10 grams as most words are not 10 characters long. The min_gram actually has a greater impact in the # of tokens than the max typically.

This is a test.

T
Th
Thi
This
h
hi
his
i
is
s
...

As you can see the # of tokens explodes with a min_gram of 1. Using the edgengram instead gives you

T
Th
Thi
This

The downside is you won't match inside of words. Typically that's not a huge deal for autocomplete. There is a site plugin (inquisitor) that let's you play with tokenizers and see the results of tokenization.

thanks for the quick response Brice and the heads up with the "edgengram", i'll check it out.
now same question with "edgengram", regarding index size?

It's not nearly as bad because the # of tokens doesn't increase anywhere as much. With min=1, max=10 you could see a max of 10x increase in tokens for each field indexed that way but that may not correspond to 10x increase in index size. That is because at least for English many words are < 10 characters long and it's likely you wouldn't use this tokenizer for every field in your index. I can't give exact #'s without testing with your data.

thanks.
I guess I'll have to run some benchmarks and see for myself :slight_smile: