Hey everyone. I'm trying to configure my index and mapping to return
documents if the user misspells a word. Using Clinton Gormley's nGram
example (https://gist.github.com/961303), I've created an index with a
mapping on one field. Searching for correct spellings works, but not
misspellings. Any idea what I'm doing wrong?
Here's a gist that can be run on the CLI easily:
As far as I understand, when indexing the "optimus prime" document, "optimus
prime" will be split into tokens "o", "op", ..., "opti", etc. Thus,
shouldn't the "o" to "opti" tokens match "optius"?
If you have any advice, I'm all ears! Thanks,
Nick
Hey everyone. I'm trying to configure my index and mapping to return
documents if the user misspells a word. Using Clinton Gormley's nGram
example (Ngram example · GitHub), I've created an index with a
mapping on one field. Searching for correct spellings works, but not
misspellings. Any idea what I'm doing wrong?
As far as I understand, when indexing the "optimus prime" document, "optimus
prime" will be split into tokens "o", "op", ..., "opti", etc. Thus,
shouldn't the "o" to "opti" tokens match "optius"?
If you have any advice, I'm all ears! Thanks,
Nick
I think I can at least explain why you do not see the results you expect:
From the gist it seems you are using an edge-ngram filter at indexing time
but a pretty standard analyzer at query time. Lets look at what will happen
for your example data:
Using the 'ascii_edge_ngram' at indexing time will index 'Optimus Prime'
into something like: [o], [op], [opt], [opti], [optim], ...
You can test this for yourself via the great analyzer API: curl -XGET
'localhost:9200/test_products/_analyze?analyzer=ascii_edge_ngram&pretty=true'
-d 'Optimus Prime'
At search time you are applying your 'ascii_std' analyzer to a mispelled
query like 'optius'. Using the analyzer API you can see this is broken down
into a single token [optius]. You notice that there is not a single token in
the indexed content that would match this. Therefore you will not have a
match with the misspelled query.
I think you need to use the same (or at least similar) analyzers at indexing
and search time. You may want to try an n-gram (not edge n-gram) and play
with the side of your grams (ngrams of size 1 produce a lot of noise for
this type of query).
I considered a phonetic analyzer, but many of the words and phrases that my
app will be indexing are non-dictionary words with non-standard
pronunciation.
Thanks for taking the time to examine the gist and write a detailed
explanation, Jan. I really appreciate it. It was also very helpful. I
changed the index and search analyzers from the 1-20-character front
edgeNGram to a 3-8-character nGram, and my searches are returning results as
I'd like.
I'm not sure if a 3-8-character nGram is optimal, though. Are there any
recommendations for how to determine the min and max characters for an nGram
filter?
Here's the latest version of the code. All of the searches return the
expected results!
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.