Searching for misspellings

Nick_Hoffman · October 13, 2011, 5:01am

Hey everyone. I'm trying to configure my index and mapping to return
documents if the user misspells a word. Using Clinton Gormley's nGram
example (https://gist.github.com/961303), I've created an index with a
mapping on one field. Searching for correct spellings works, but not
misspellings. Any idea what I'm doing wrong?

Here's a gist that can be run on the CLI easily:

gist.github.com

https://gist.github.com/nickhoffman/1283380

gistfile1.sh

echo 'Delete the index.'
curl -X DELETE 'http://localhost:9200/test_products/?pretty=true'

echo; echo
echo 'Create the index. Copied directly from https://gist.github.com/961303 .'
curl -X PUT 'http://localhost:9200/test_products/?pretty=true' -d '
{
   "settings" : {
      "analysis" : {

This file has been truncated. show original

As far as I understand, when indexing the "optimus prime" document, "optimus
prime" will be split into tokens "o", "op", ..., "opti", etc. Thus,
shouldn't the "o" to "opti" tokens match "optius"?

If you have any advice, I'm all ears! Thanks,
Nick

Karussell1 · October 13, 2011, 7:37am

for misspelling this is not the method of choice. opti just matches
opti

You can create an phonetic analyzer:

or wait for Lucene 4.0 (or use fuzzy queries now if not too many
docs):

On 13 Okt., 07:01, Nick Hoffman n...@deadorange.com wrote:

Hey everyone. I'm trying to configure my index and mapping to return
documents if the user misspells a word. Using Clinton Gormley's nGram
example (Ngram example · GitHub), I've created an index with a
mapping on one field. Searching for correct spellings works, but not
misspellings. Any idea what I'm doing wrong?

Here's a gist that can be run on the CLI easily:Why does a search for a misspelling ("optius" instead of "optimus") find no documents? · GitHub

As far as I understand, when indexing the "optimus prime" document, "optimus
prime" will be split into tokens "o", "op", ..., "opti", etc. Thus,
shouldn't the "o" to "opti" tokens match "optius"?

If you have any advice, I'm all ears! Thanks,
Nick

Jan_Fiedler · October 13, 2011, 7:39am

I think I can at least explain why you do not see the results you expect:
From the gist it seems you are using an edge-ngram filter at indexing time
but a pretty standard analyzer at query time. Lets look at what will happen
for your example data:

Using the 'ascii_edge_ngram' at indexing time will index 'Optimus Prime'
into something like: [o], [op], [opt], [opti], [optim], ...
You can test this for yourself via the great analyzer API: curl -XGET
'localhost:9200/test_products/_analyze?analyzer=ascii_edge_ngram&pretty=true'
-d 'Optimus Prime'

At search time you are applying your 'ascii_std' analyzer to a mispelled
query like 'optius'. Using the analyzer API you can see this is broken down
into a single token [optius]. You notice that there is not a single token in
the indexed content that would match this. Therefore you will not have a
match with the misspelled query.

I think you need to use the same (or at least similar) analyzers at indexing
and search time. You may want to try an n-gram (not edge n-gram) and play
with the side of your grams (ngrams of size 1 produce a lot of noise for
this type of query).

Nick_Hoffman · October 13, 2011, 2:46pm

I considered a phonetic analyzer, but many of the words and phrases that my
app will be indexing are non-dictionary words with non-standard
pronunciation.

Nick_Hoffman · October 13, 2011, 2:53pm

Thanks for taking the time to examine the gist and write a detailed
explanation, Jan. I really appreciate it. It was also very helpful. I
changed the index and search analyzers from the 1-20-character front
edgeNGram to a 3-8-character nGram, and my searches are returning results as
I'd like.

I'm not sure if a 3-8-character nGram is optimal, though. Are there any
recommendations for how to determine the min and max characters for an nGram
filter?

Here's the latest version of the code. All of the searches return the
expected results!

gist.github.com

https://gist.github.com/nickhoffman/1283380/89b50d76bef849767a4d5980ebf042d1c309a2bf

gistfile1.sh

echo 'Delete the index.'
curl -X DELETE 'http://localhost:9200/test_products/?pretty=true'

echo; echo
echo 'Create the index. Copied directly from https://gist.github.com/961303 .'
curl -X PUT 'http://localhost:9200/test_products/?pretty=true' -d '
{
   "settings" : {
      "analysis" : {

This file has been truncated. show original

Thanks again

Topic		Replies	Views
Misspelling Elasticsearch	4	685	July 6, 2017
Search using misspelled querys? Elasticsearch	1	225	July 12, 2021
Spell check suggestion Elasticsearch	2	5097	June 1, 2017
Terms API for Spellchecker Elasticsearch	8	984	July 6, 2017
Search for phrase, even with missing letter Elasticsearch	3	2370	July 5, 2017

Searching for misspellings

Related topics