Anyone has a performance evaluation with NGram tokenizer?

fraf · June 16, 2022, 7:47am

Hello searchers,

I think, one of the best tokenizer algorithm (and maybe the most used), is the NGram tokenizer (equivalent to the SQL : where table.column like '%search_text%'.
Indeed, given an input text, it can generate a lot of (very lot) token. So given an input search text, it is likely to find a matched token. But for me, it is the easy solution, without taking into account the performance other side : how long does it takes to update this kind of index ? how much weight does it represents for searching in general, because this NGram algorithm can ask for a lot of disk space , too much as long as min_gram is low.

Any performance feedback ? Thx.

EDIT : is it possible to have any metrics which say "given this fulltext field of this index name, we have X generated token. Among all these tokens, you can see how often, for each one, it has been candidate and so matched a given input text". ?

Christian_Dahlqvist · June 16, 2022, 8:59am

It sounds like the the wildcard field type might be what you are looking for.

system · July 14, 2022, 8:59am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Alternative to ngram tokenizer and token filter. Substring issue Elasticsearch	6	1609	July 5, 2017
NGRAM Tokens and query_string question Elasticsearch	3	734	May 4, 2017
nGram performance Elasticsearch	3	3577	July 6, 2017
Elasticsearch ngram tokenizer Elasticsearch	4	807	February 10, 2020
Which is the best (right) use of NGrams? Elasticsearch	19	5574	July 6, 2017

Anyone has a performance evaluation with NGram tokenizer?

Related topics