Advice on my approach to this search problem

Nick_Hoffman · October 15, 2011, 3:20pm

Hey guys. I've been using ES for 1-2 weeks now, and love it. Being very new
to it, though, I've been piecing together bits of knowledge as I go along. I
have a semi-working solution to a problem, but I'm sure that it's nowhere
close to an ideal solution. How would you approach this problem?

First, the data: The app has catalogs, products, and items, all of which are
related. When a user performs a search, though, only products need to be
found. For example, if there's a catalog named "Transformers" and the user
searches for "Transformers", all products in that catalog should be
returned. To accomplish this, when indexing products, I'm nesting the
related catalog and item data inside the product data. Eg:
{ name: "Optimus Prime", number: "TFG1S1-1",
catalog: { name: "Series 1", number: "TFG1S1" },
items: [ { name: "Optimus Prime" }, { name: "Instructions" } ]
}

When searching, some users will misspell a word (Eg: "Trasformers", missing
the "n"), and some will provide a partial word (Eg: "Trans"). Despite this,
users expect to receive search results. Eg: Products with a field that
contains "Transformer" or "retransmit" could match.

My solution right now is this:

gist.github.com

https://gist.github.com/nickhoffman/0cbe6892b4bc720bda92

1_settings.sh

curl -X PUT 'localhost:9200/development_products?pretty=true' -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "3_8_ngram": {
          "type": "nGram",
          "min_gram": 3,
          "max_gram": 8
        }

This file has been truncated. show original

2_mapping.sh

curl -X PUT 'localhost:9200/development_products/product/_mapping?pretty=true' -d '
{
  "product": {
    "properties": {

      "id": {
        "index": "not_analyzed",
        "type": "string"
      },

This file has been truncated. show original

3_index_2_documents.sh

curl -X POST 'http://localhost:9200/development_products/product/4e6299c3349c301b290002ed?pretty=true' -d '
{
  "name":   "Grimlock",
  "number": "TFG1S2-15",

  "properties": {
    "Package Type": "MISB",
    "Manufacturer": "Hasbro"
  },

This file has been truncated. show original

There are more than three files. show original

Do you have any suggestions for improvements? Thanks for your help!
Nick

Nick_Hoffman · October 17, 2011, 2:34pm

Bump!

Karussell1 · October 17, 2011, 2:56pm

spell check can be done via a dictionary

github.com/elastic/elasticsearch

Analysis: Integration with Hunspell

opened 08:56PM - 22 Jan 11 UTC

closed 03:46AM - 02 Jan 13 UTC

lukas-vlcek

v0.90.0.Beta1

Feature added by @uboness: Added basic support for hunspell stemming. Hunspell …dictionaries will be picked up from a dedicated hunspell directory on the filesystem (defaults to _<path.conf>_/hunspell). Each dictionary is expected to have its own directory named after its associated locale (language). This dictionary directory is expected to hold both the *.aff and *.dic files (all of which will automatically be picked up). For example, assuming the default hunspell location is used, the following directory layout will define the _en_US_ dictionary: ``` - conf |-- hunspell | |-- en_US | | |-- en_US.dic | | |-- en_US.aff ``` The location of the hunspell directory can be configured using the `indices.analysis.hunspell.dictionary.location` settings in _elasticsearch.yml_. Each dictionary can be configured with two settings: - `ignore_case` - If true, dictionary matching will be case insensitive (defaults to `false`) - `strict_affix_parsing` - Determines whether errors while reading a affix rules file will cause exception or simple be ignored (defaults to `true`) These settings can be configured globally in `elasticsearch.yml` using `indices.analysis.hunspell.dictionary.ignore_case` and `indices.analysis.hunspell.dictionary.strict_affix_parsing`, or for specific dictionaries: `indices.analysis.hunspell.dictionary.en_US.ignore_case` and `indices.analysis.hunspell.dictionary.en_US.strict_affix_parsing`. It is also possible to add `settings.yml` file under the dictionary directory which holds these settings (this will override any other settings defined in the `elasticsearch.yml`). One can use the hunspell stem filter by configuring it the analysis settings: ``` json { "analysis" : { "analyzer" : { "en" : { "tokenizer" : "standard", "filter" : [ "lowercase", "en_US" ] } }, "filter" : { "en_US" : { "type" : "hunspell", "locale" : "en_US", "dedup" : true } } } } ``` ## Original Request: Hunspell is a spell checker and morphological analyzer designed for languages with rich morphology and complex word compounding and character encoding. [1] Wikipedia, http://en.wikipedia.org/wiki/Hunspell [2] Source code, http://hunspell.sourceforge.net/ [3] Hunspell-Lucene integration, http://code.google.com/p/lucene-hunspell/ [4] presentation by Chris Male, EuroCon 2010, http://lucene-eurocon.org/slides/European-Language-Analysis-with-Hunspell_Chris-Male.pdf (annotation of his talk can be found here: http://lucene-eurocon.org/sessions-track2-day2.html#5)

,lucene 4

or with a phonetic analyzer

is that what you were after?

On 17 Okt., 16:34, Nick Hoffman n...@deadorange.com wrote:

Bump!

Clinton_Gormley · October 17, 2011, 3:07pm

On Sat, 2011-10-15 at 08:20 -0700, Nick Hoffman wrote:

Hey guys. I've been using ES for 1-2 weeks now, and love it. Being
very new to it, though, I've been piecing together bits of knowledge
as I go along. I have a semi-working solution to a problem, but I'm
sure that it's nowhere close to an ideal solution. How would you
approach this problem?

Hi Nick

The reason your misspellings work is because you are using ngrams for
both your search and index analyzers.

This may, however, give your users weird results, eg the user searches
for "slave" and gets a result for "lavatory" instead.

I would consider making a few changes:

use edge ngrams rather than ngrams ie s,sl,sla,slav,slave
use the edge ngram analyzers only as your search_analyzer
for your misspellings, if you get no results, then retry
the query using some fuzziness:
Elasticsearch Platform — Find real-time answers at scale | Elastic

clint

Nick_Hoffman · October 17, 2011, 10:00pm

On Monday, 17 October 2011 10:56:54 UTC-4, Karussell wrote:

spell check can be done via a dictionary

Analysis: Integration with Hunspell · Issue #646 · elastic/elasticsearch · GitHub

,lucene 4

"Did you mean" spellchecking · Issue #911 · elastic/elasticsearch · GitHub

or with a phonetic analyzer

ruby on rails - Elastic Search - implement "Did you Mean" - Stack Overflow

is that what you were after?

Thanks for the suggestions, mate. Unfortunately, I can't do spell checking
with a dictionary because many of the words are unique names. Eg:
Optimus Prime
Megatron
BE@RBRICK
etc

I was going to try a phonetic analyzer, but a lot of the names in my data
are pronounced strangely, and thus wouldn't match.

Nick_Hoffman · October 17, 2011, 10:07pm

On Monday, 17 October 2011 11:07:33 UTC-4, Clinton Gormley wrote:

The reason your misspellings work is because you are using ngrams for
both your search and index analyzers.

Yeah, I figured as much.

This may, however, give your users weird results, eg the user searches
for "slave" and gets a result for "lavatory" instead.

I would consider making a few changes:

use edge ngrams rather than ngrams ie s,sl,sla,slav,slave

Interesting. Why do you recommend that? I understand that it prevents the
slave/lavatory example, which is great. However, it prevents mid-word
matches. But then again, maybe that's a good thing...

use the edge ngram analyzers only as your search_analyzer

So don't use them in any of the index analyzers?

for your misspellings, if you get no results, then retry
the query using some fuzziness:
Elasticsearch Platform — Find real-time answers at scale | Elastic

A text query with the "fuzziness" option, or a fuzzy query[1]?

Thanks for your advice, Clint, and also for that example nGram gist. Very
helpful!

[1] Elasticsearch Platform — Find real-time answers at scale | Elastic

Clinton_Gormley · October 18, 2011, 7:27am

Hiya

    This may, however, give your users weird results, eg the user
    searches
    for "slave" and gets a result for "lavatory" instead.
    
    I would consider making a few changes:
    1) use edge ngrams rather than ngrams ie s,sl,sla,slav,slave
Interesting. Why do you recommend that? I understand that it prevents
the slave/lavatory example, which is great. However, it prevents
mid-word matches. But then again, maybe that's a good thing...

Yes exactly. Full ngrams are useful for some purposes, eg matching
words in a URL, but in general, people start typing at one end of a word
and expect the search results to reflect that.

    2) use the edge ngram analyzers only as your search_analyzer

So don't use them in any of the index analyzers?

Apologies - I meant the other way around. Use them in your index
analyzers, but use your ascii_std analyzer for search analyzers.

    3) for your misspellings, if you get no results, then retry
       the query using some fuzziness:  
    
    http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

A text query with the "fuzziness" option, or a fuzzy query[1]?

text with fuzziness. A fuzzy query is actually a term query - the
search terms are not analyzed. However, a text query with fuzziness
gives you the analysis plus the fuzzy behaviour.

Thanks for your advice, Clint, and also for that example nGram gist.
Very helpful!

glad to hear it

clint

Nick_Hoffman · October 19, 2011, 3:48am

Apologies - I meant the other way around. Use them in your index
analyzers, but use your ascii_std analyzer for search analyzers.

Thanks, Clint. That's working a lot better. However, I've noticed that I
can't combine certain fields in the same text query. It seems to be related
to nested fields. Any idea why that might be?

For example, ES accepts this:

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d
'{ query: { text: { "items.name": "optimus", "catalog.name": "optimus" } }
}'

But adding the "name" field to the beginning or end:

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d
'{ query: { text: { "items.name": "optimus", "catalog.name": "optimus",
"name": "optimus" } } }'

generates an error:

{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query],
total failure; shardFailures
{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][0]:
SearchParseException[[development_products][0]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][0]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][2]:
SearchParseException[[development_products][2]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][2]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][1]:
SearchParseException[[development_products][1]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][1]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][3]:
SearchParseException[[development_products][3]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][3]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][4]:
SearchParseException[[development_products][4]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][4]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }]",
"status" : 500
}

Clinton_Gormley · October 19, 2011, 10:04am

Thanks, Clint. That's working a lot better. However, I've noticed that
I can't combine certain fields in the same text query. It seems to be
related to nested fields. Any idea why that might be?

You can't pass multiple field/search_text pairs to a single query. ES
needs to know how to combine your various queries, so you need to have
each as a separate 'text' query and combine them using either bool or
dismax

clint

For example, ES accepts this:

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d '{ query: { text: { "items.name": "optimus", "catalog.name": "optimus" } } }'

But adding the "name" field to the beginning or end:

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d '{ query: { text: { "items.name": "optimus", "catalog.name": "optimus", "name": "optimus" } } }'

generates an error:

{
"error" : "SearchPhaseExecutionException[Failed to execute phase
[query], total failure; shardFailures
{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][0]:
SearchParseException[[development_products][0]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][0]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element
[name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][2]:
SearchParseException[[development_products][2]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][2]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element
[name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][1]:
SearchParseException[[development_products][1]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][1]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element
[name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][3]:
SearchParseException[[development_products][3]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][3]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element
[name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][4]:
SearchParseException[[development_products][4]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][4]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element [name]]]; }]",
"status" : 500
}

Nick_Hoffman · October 19, 2011, 3:23pm

That makes sense. Thanks. With dis_max, though, it looks like the
edge-ngrams aren't being used.

For example, there're 3 documents whose "name" field is "Optimus Primal",
and 35 whose "name" field is "Optimus Prime". I figured that a dis_max-text
query for "primal" would match "Optimus Primal" and "Optimus Prime" docs.
Unfortunately, only the 3 "Optimus Primal" docs matched. Why might that be?

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d
'
{
fields: [ "name" ],
query: {
dis_max: {
queries: [
{ text: { "name" : "primal" } }
]
}
}
}
'

Clinton_Gormley · October 19, 2011, 3:47pm

For example, there're 3 documents whose "name" field is "Optimus
Primal", and 35 whose "name" field is "Optimus Prime". I figured that
a dis_max-text query for "primal" would match "Optimus Primal" and
"Optimus Prime" docs. Unfortunately, only the 3 "Optimus Primal" docs
matched. Why might that be?

Because you are no longer using ngrams on your search analyzer, so we're
essentially doing a search for "primal*"

Try the same thing but search for "prim" instead

clint

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d '

{
fields: [ "name" ],
query: {
dis_max: {
queries: [
{ text: { "name" : "primal" } }
]
}
}
}
'

Topic		Replies	Views
User search for products Elasticsearch	1	327	July 6, 2017
Search for a keyword in the field in the title, which can occur simultaneously several matching words from the query Elasticsearch	1	203	May 6, 2023
Advice on search strategy Elasticsearch	1	310	July 6, 2017
Howto: extract (misspelled?) keywords from a string Elasticsearch	2	421	July 6, 2017
Advice about mapping Elasticsearch	3	327	July 6, 2017

Advice on my approach to this search problem

Related topics