Long Words Matching


(Mats Stijlaart) #1

I'm creating a index that contains both small and long words (more then 18)
characters.

Currently i am using a ngram token filter with a minimum of 1 and a maximum
of 20.
Words that have more then 20 characters will not match because they are not
fully indexed.

It is possible to index those words 'untouched'. But then i still cannot
get matches for substrings with a length between 20 and the full-word
length.

I've considered to increase the ngram maximum to 50, but i think that will
bring down the performance and theoretically does not solve the problem (i
cannot think of any, but words may have more then 50 characters).

I'm looking for the 'best' solution to tackle this problem that keeps my
performance intact and allows me to find any long word with any random
substring.

Thanks.


(Ævar Arnfjörð Bjarmason) #2

I'm not sure how exactly the scoring ends up working out, but you
don't need an nGram tokenizer of length N to match a word of length N,
although your scoring might suffer slightly.

Let's say you split up "Stijlaart" with an ngram 2..3 tokenizer. Then
you'll get:

St ti ij jl ..
Sti tij ijl jla ..

But you can still match "Stijlaart", because you'll match "st" AND ti"
AND "ij" etc.

I'm not sure how the scoring for that will compare to having an nGram
tokenizer that's long enough to fully include the term you're
searching for, but for getting a match you don't need to do what you
suggested.

I have a combination of trigram and 2..3gram nGram filters to to fuzzy
matching on arbitrary text that usually includes words longer than
that.


(system) #3