Hello,
I was reading this group posts and it seems to be two school of thoughts
for ngram use
- index with ngram enabled analyzer but search with analyzer without
ngrams so that a complete search terms are matched against ngrams - index with ngrams and search with ngrams
My understanding is:
#1 will require very long ngrams, there will be very few (one?) term
matches per document and the longer/rarer the ngram matched the better is
the match . It is essentially generationg tonns of "synonyms" (ngrams) for
your searched field and match your terms to them. One of the problem is
that ngram length should essentially be longer that the longest word. That
seems to be an issue - while handful of characters is often enough to
identify the document (think auto-complete scenario) providing a longer
than max ngram length search token will return no hits
#2 will need short 3-5 character ngrams at most and will match n-grammed
search term against ngrammed field in the index. The more matches the
better score. The precision is probably not as good as #1 so it would need
to be combined with search on original field and maybe shingled field. But
will potentially handle simple typos
I have two use cases (both to be used in auto-complete pick lists)
- A long identifier (contract number) 10-30 character which needs to be
searched on any part of it - Company name which need to be searched on individual words from start of
the words (could use phrase prefix query or edgeNgram)
Could you please share your opinion about #1 and #2 (and any other
techniques you used) and their applicability to my cases
Thank you,
Alex
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.