Advice on my approach to this search problem


(Nick Hoffman) #1

Hey guys. I've been using ES for 1-2 weeks now, and love it. Being very new
to it, though, I've been piecing together bits of knowledge as I go along. I
have a semi-working solution to a problem, but I'm sure that it's nowhere
close to an ideal solution. How would you approach this problem?

First, the data: The app has catalogs, products, and items, all of which are
related. When a user performs a search, though, only products need to be
found. For example, if there's a catalog named "Transformers" and the user
searches for "Transformers", all products in that catalog should be
returned. To accomplish this, when indexing products, I'm nesting the
related catalog and item data inside the product data. Eg:
{ name: "Optimus Prime", number: "TFG1S1-1",
catalog: { name: "Series 1", number: "TFG1S1" },
items: [ { name: "Optimus Prime" }, { name: "Instructions" } ]
}

When searching, some users will misspell a word (Eg: "Trasformers", missing
the "n"), and some will provide a partial word (Eg: "Trans"). Despite this,
users expect to receive search results. Eg: Products with a field that
contains "Transformer" or "retransmit" could match.

My solution right now is this:

Do you have any suggestions for improvements? Thanks for your help!
Nick


(Nick Hoffman) #2

Bump!


(Karussell) #3

spell check can be done via a dictionary

,lucene 4

or with a phonetic analyzer

is that what you were after?

On 17 Okt., 16:34, Nick Hoffman n...@deadorange.com wrote:

Bump!


(Clinton Gormley) #4

On Sat, 2011-10-15 at 08:20 -0700, Nick Hoffman wrote:

Hey guys. I've been using ES for 1-2 weeks now, and love it. Being
very new to it, though, I've been piecing together bits of knowledge
as I go along. I have a semi-working solution to a problem, but I'm
sure that it's nowhere close to an ideal solution. How would you
approach this problem?

Hi Nick

The reason your misspellings work is because you are using ngrams for
both your search and index analyzers.

This may, however, give your users weird results, eg the user searches
for "slave" and gets a result for "lavatory" instead.

I would consider making a few changes:

  1. use edge ngrams rather than ngrams ie s,sl,sla,slav,slave
  2. use the edge ngram analyzers only as your search_analyzer
  3. for your misspellings, if you get no results, then retry
    the query using some fuzziness:
    http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

clint


(Nick Hoffman) #5

On Monday, 17 October 2011 10:56:54 UTC-4, Karussell wrote:

spell check can be done via a dictionary

https://github.com/elasticsearch/elasticsearch/issues/646

,lucene 4

https://github.com/elasticsearch/elasticsearch/issues/911

or with a phonetic analyzer

http://stackoverflow.com/questions/6936256/elastic-search-implement-did-you-mean

is that what you were after?

Thanks for the suggestions, mate. Unfortunately, I can't do spell checking
with a dictionary because many of the words are unique names. Eg:
Optimus Prime
Megatron
BE@RBRICK
etc

I was going to try a phonetic analyzer, but a lot of the names in my data
are pronounced strangely, and thus wouldn't match.


(Nick Hoffman) #6

On Monday, 17 October 2011 11:07:33 UTC-4, Clinton Gormley wrote:

The reason your misspellings work is because you are using ngrams for
both your search and index analyzers.

Yeah, I figured as much.

This may, however, give your users weird results, eg the user searches
for "slave" and gets a result for "lavatory" instead.

I would consider making a few changes:

  1. use edge ngrams rather than ngrams ie s,sl,sla,slav,slave

Interesting. Why do you recommend that? I understand that it prevents the
slave/lavatory example, which is great. However, it prevents mid-word
matches. But then again, maybe that's a good thing...

  1. use the edge ngram analyzers only as your search_analyzer

So don't use them in any of the index analyzers?

  1. for your misspellings, if you get no results, then retry
    the query using some fuzziness:
    http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

A text query with the "fuzziness" option, or a fuzzy query[1]?

Thanks for your advice, Clint, and also for that example nGram gist. Very
helpful!

[1] http://www.elasticsearch.org/guide/reference/query-dsl/fuzzy-query.html


(Clinton Gormley) #7

Hiya

    This may, however, give your users weird results, eg the user
    searches
    for "slave" and gets a result for "lavatory" instead.
    
    I would consider making a few changes:
    1) use edge ngrams rather than ngrams ie s,sl,sla,slav,slave

Interesting. Why do you recommend that? I understand that it prevents
the slave/lavatory example, which is great. However, it prevents
mid-word matches. But then again, maybe that's a good thing...

Yes exactly. Full ngrams are useful for some purposes, eg matching
words in a URL, but in general, people start typing at one end of a word
and expect the search results to reflect that.

    2) use the edge ngram analyzers only as your search_analyzer

So don't use them in any of the index analyzers?

Apologies - I meant the other way around. Use them in your index
analyzers, but use your ascii_std analyzer for search analyzers.

    3) for your misspellings, if you get no results, then retry
       the query using some fuzziness:  
    
    http://www.elasticsearch.org/guide/reference/query-dsl/text-query.html

A text query with the "fuzziness" option, or a fuzzy query[1]?

text with fuzziness. A fuzzy query is actually a term query - the
search terms are not analyzed. However, a text query with fuzziness
gives you the analysis plus the fuzzy behaviour.

Thanks for your advice, Clint, and also for that example nGram gist.
Very helpful!

glad to hear it :slight_smile:

clint


(Nick Hoffman) #8

Apologies - I meant the other way around. Use them in your index
analyzers, but use your ascii_std analyzer for search analyzers.

Thanks, Clint. That's working a lot better. However, I've noticed that I
can't combine certain fields in the same text query. It seems to be related
to nested fields. Any idea why that might be?

For example, ES accepts this:

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d
'{ query: { text: { "items.name": "optimus", "catalog.name": "optimus" } }
}'

But adding the "name" field to the beginning or end:

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d
'{ query: { text: { "items.name": "optimus", "catalog.name": "optimus",
"name": "optimus" } } }'

generates an error:

{
"error" : "SearchPhaseExecutionException[Failed to execute phase [query],
total failure; shardFailures
{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][0]:
SearchParseException[[development_products][0]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][0]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][2]:
SearchParseException[[development_products][2]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][2]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][1]:
SearchParseException[[development_products][1]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][1]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][3]:
SearchParseException[[development_products][3]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][3]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][4]:
SearchParseException[[development_products][4]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to parse
source [{ query: { text: { "items.name": "optimus", "catalog.name":
"optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][4]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser for
element [name]]]; }]",
"status" : 500
}


(Clinton Gormley) #9

Thanks, Clint. That's working a lot better. However, I've noticed that
I can't combine certain fields in the same text query. It seems to be
related to nested fields. Any idea why that might be?

You can't pass multiple field/search_text pairs to a single query. ES
needs to know how to combine your various queries, so you need to have
each as a separate 'text' query and combine them using either bool or
dismax

clint

For example, ES accepts this:

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d '{ query: { text: { "items.name": "optimus", "catalog.name": "optimus" } } }'

But adding the "name" field to the beginning or end:

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d '{ query: { text: { "items.name": "optimus", "catalog.name": "optimus", "name": "optimus" } } }'

generates an error:

{
"error" : "SearchPhaseExecutionException[Failed to execute phase
[query], total failure; shardFailures
{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][0]:
SearchParseException[[development_products][0]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][0]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element
[name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][2]:
SearchParseException[[development_products][2]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][2]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element
[name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][1]:
SearchParseException[[development_products][1]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][1]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element
[name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][3]:
SearchParseException[[development_products][3]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][3]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element
[name]]]; }{[V6gkYzvcSg6Gx-Ad9-hbOg][development_products][4]:
SearchParseException[[development_products][4]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [Failed to
parse source [{ query: { text: { "items.name": "optimus",
"catalog.name": "optimus", "name": "optimus" } } }]]]; nested:
SearchParseException[[development_products][4]:
query[items.name:optimus],from[-1],size[-1]: Parse Failure [No parser
for element [name]]]; }]",
"status" : 500
}


(Nick Hoffman) #10

That makes sense. Thanks. With dis_max, though, it looks like the
edge-ngrams aren't being used.

For example, there're 3 documents whose "name" field is "Optimus Primal",
and 35 whose "name" field is "Optimus Prime". I figured that a dis_max-text
query for "primal" would match "Optimus Primal" and "Optimus Prime" docs.
Unfortunately, only the 3 "Optimus Primal" docs matched. Why might that be?

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d
'
{
fields: [ "name" ],
query: {
dis_max: {
queries: [
{ text: { "name" : "primal" } }
]
}
}
}
'


(Clinton Gormley) #11

For example, there're 3 documents whose "name" field is "Optimus
Primal", and 35 whose "name" field is "Optimus Prime". I figured that
a dis_max-text query for "primal" would match "Optimus Primal" and
"Optimus Prime" docs. Unfortunately, only the 3 "Optimus Primal" docs
matched. Why might that be?

Because you are no longer using ngrams on your search analyzer, so we're
essentially doing a search for "primal*"

Try the same thing but search for "prim" instead

clint

curl -X GET -s
"http://localhost:9200/development_products/product/_search?pretty=true" -d '

{
fields: [ "name" ],
query: {
dis_max: {
queries: [
{ text: { "name" : "primal" } }
]
}
}
}
'


(system) #12