Hello searchers,
I think, one of the best tokenizer algorithm (and maybe the most used), is the NGram tokenizer (equivalent to the SQL : where table.column like '%search_text%
'.
Indeed, given an input text, it can generate a lot of (very lot) token. So given an input search text, it is likely to find a matched token. But for me, it is the easy solution, without taking into account the performance other side : how long does it takes to update this kind of index ? how much weight does it represents for searching in general, because this NGram algorithm can ask for a lot of disk space , too much as long as min_gram
is low.
Any performance feedback ? Thx.
EDIT : is it possible to have any metrics which say "given this fulltext field of this index name, we have X generated token. Among all these tokens, you can see how often, for each one, it has been candidate and so matched a given input text". ?