Fuzzy search and term length


(Michael Scholl) #1

Hi there,

we build an analysis tool for citations.
We're doing a parent/child matching while importing the citations.
(Right now we do a flt search on the title with min_similarity option set to 0.9 and get the first hit with score above 0.8)

For most citations this matching works great and results are perfect (getting together titles with small differences in typing).
The only problem are short terms, like "test" matching with longer titles like "this is a test".

Asking on irc for that problem we were suggested to use another analyzer like phonetic or combine score of different analyzers
(http://www.elasticsearch.org/guide/reference/index-modules/analysis/phonetic-tokenfilter.html).

That problem was stopping us using flt for Author/Publisher matching and switching back to term based matching.
Single letters were a huge problem there.

But clintongormley also mentioned, that an option for flt to get length of search term into scoring could be a future feature.
Is there already an issue for that we could vote for? :slight_smile:

Any other ideas for matching are welcome!
(Reading of http://zmievski.org/2011/03/duplicates-detection-with-elasticsearch was leading us to fuzzy)

Cheers

Mischosch


(system) #2