I implemented a custom filter which uses the EdgeGram tokenizer. The
problem I face is that whether I search for something relevant or total
garbage I get a large number of hits. I suspect that this is due that fact
that I'm using an EdgeNgram tokenizer. How can I ensure the quality of the
search results and have ES return no matches for low quality results?
Does the fact that elastic search returns a large number of hits (~100,000)
mean that there's probably something wrong in how I'm constructing the
query? And does returning this many results affect query time
substantially?
First, noone knows the requirements of your application. There might be
applications where it is inacceptable to not return any hits, so people
like to rather return 100k hits instead of none. Most of the time this is
not the case though.
Second, those 100k hits might be valid. Maybe you just used a generic
search term. Hard to tell from the outside, if your search quality is good
or bad. Judging from your comment, it sounds different however.
So, what can be done about it? You should try and use the analyze API,
which helps you, how the content of a field is acutally broken down and
stored in the index. This automatically helps you to understand, why a
certain term might match on this document. Also the Explain API may be used
for this. See
If you want to use a nice graphical frontend for the analyze API, try the
inquisitor plugin.
In your setup another strategy might be to try another search first, and
only if that one does not match, you could use the (edge)ngram one, which
is most likely to return more hits. Also, I guess there was a reason, why
you chose that configuration in the first place. With a couple of examples,
we might help to explain, why certain documents are actually being
returned. A reproducible gist for others might help here.
Hope this helps as a start to dive deeper into your issue.
I implemented a custom filter which uses the EdgeGram tokenizer. The
problem I face is that whether I search for something relevant or total
garbage I get a large number of hits. I suspect that this is due that fact
that I'm using an EdgeNgram tokenizer. How can I ensure the quality of the
search results and have ES return no matches for low quality results?
Does the fact that Elasticsearch returns a large number of hits
(~100,000) mean that there's probably something wrong in how I'm
constructing the query? And does returning this many results affect query
time substantially?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.