Long Words Matching

Mats_Stijlaart · June 4, 2012, 9:34am

I'm creating a index that contains both small and long words (more then 18)
characters.

Currently i am using a ngram token filter with a minimum of 1 and a maximum
of 20.
Words that have more then 20 characters will not match because they are not
fully indexed.

It is possible to index those words 'untouched'. But then i still cannot
get matches for substrings with a length between 20 and the full-word
length.

I've considered to increase the ngram maximum to 50, but i think that will
bring down the performance and theoretically does not solve the problem (i
cannot think of any, but words may have more then 50 characters).

I'm looking for the 'best' solution to tackle this problem that keeps my
performance intact and allows me to find any long word with any random
substring.

Thanks.

AEvar_Arnfjord_Bjarm · June 4, 2012, 11:18am

I'm not sure how exactly the scoring ends up working out, but you
don't need an nGram tokenizer of length N to match a word of length N,
although your scoring might suffer slightly.

Let's say you split up "Stijlaart" with an ngram 2..3 tokenizer. Then
you'll get:

St ti ij jl ..
Sti tij ijl jla ..

But you can still match "Stijlaart", because you'll match "st" AND ti"
AND "ij" etc.

I'm not sure how the scoring for that will compare to having an nGram
tokenizer that's long enough to fully include the term you're
searching for, but for getting a match you don't need to do what you
suggested.

I have a combination of trigram and 2..3gram nGram filters to to fuzzy
matching on arbitrary text that usually includes words longer than
that.

Topic		Replies	Views
Which is the best (right) use of NGrams? Elasticsearch	19	5496	July 6, 2017
Query returning false results when term exceeds ngram length Elasticsearch	6	1559	January 16, 2018
Elasticsearch ngram tokenizer Elasticsearch	4	792	February 10, 2020
Using Exact Prefix/MatchPhrase Prefix Queries with Ngram Filter Elasticsearch	2	669	September 9, 2020
nGram ordered partial/phrase matching - how to split search token by length? Elasticsearch	1	487	November 12, 2018

Long Words Matching

Related topics