Synonym Problem


(RobBob) #1

Hello,

I am running into an issue with a synonym pair I have set up. I have the synonyms, "bar,pub" and while returning results from about 3000 different categories one of the two will not appear given a direct search query.

Below are example results for each query:

And here is my current configuration. Any suggestions on what I could do to make sure bar will appear in the first 10 results and pub would return in the first 10 results when queried for?

curl -XPUT http://localhost:9200/categories -d '{
"settings": {
  "analysis": {
     "filter": {
        "edge_ngram_filter": {
           "type": "edge_ngram",
           "min_gram": 1,
           "max_gram": 20
        },
		"category_synonym_filter": {
		   "type": "synonym",
		   "synonyms": ["bike,bicycle", "bar,pub", "shop,store", "burger,hamburger", "bbq,barbecue", "isp,internet service provider", "exterminator,pest control service", "adult entertainment club,strip club"]
		}
     },
     "analyzer": {
        "edge_ngram_analyzer": {
           "type": "custom",
           "tokenizer": "standard",
           "filter": [
              "lowercase",
              "asciifolding",
              "edge_ngram_filter"
           ]
        },
        "search_analyzer": {
           "type": "custom",
           "tokenizer": "standard",
           "filter": [
              "lowercase",
              "asciifolding",
              "category_synonym_filter"
           ]
        }
     }
  }
},
"mappings": {
  "category": {
     "properties": {
        "category_description": {
           "type": "string",
           "analyzer": "edge_ngram_analyzer",
           "search_analyzer": "search_analyzer"
        },
      	"type" : {
        	"type" : "string",
        	"index" : "not_analyzed"
     	}
     }
  }
}
}'

Thank you for any help in advance!


(RobBob) #2

Is there any more information I can provide that would help someone guide me in the right direction? Thanks!


(Mark Harwood) #3

A user's search input does not have to be modelled as a single query clause.
Often it's beneficial to try several different interpretations of their input in a single search request using an array of queries in the should clause of a containing bool query. The more clauses that match a should array, the better the score.
You could try (in reverse order of importance):

  1. An exact-match phrase query on full-words
  2. An exact match query on full words
  3. An partial match query using n-grams.

Currently you are are only doing 3). If you also index minus the n-grams you can do 2) as well and give an extra boost to a match on that query using the boost parameter. If someone search for irish bar then 1) would help rank matches better too.


Problem on scoring and sorting retrieved documents
(RobBob) #4

Hello! Thank you for the reply :slight_smile:

Okay, I believe I see what you are saying about using several should clauses instead of one must.

Right now my current query looks like this:

{size=100,
query={
    bool={
        must=[{
            match={
                category_description={
                    fuzziness=AUTO,
                    query=bar,
                    operator=and
                }
            }
        }, {
            term={type=BUSINESS}
        }]
    }
}, 
from=0}

So instead of the one must I should have several shoulds. Something like this?

{size=100,
query={
    bool={
        should=[{
            match_phrase={
                category_description=bar
            },
            term={
                category_description=bar
            },
            match={
                category_description={
                    fuzziness=AUTO,
                    query=bar,
                    operator=and
                }
            }
        }, {
            term={type=BUSINESS}
        }]
    }
}, 
from=0}

What do you mean when you say, "also index minus n-grams"?

Thank you again for the reply!


(RobBob) #5

So my latest query looks like this:

{size=100,
query={
    bool={
        should=[
            {match_phrase={category_description=bar}},
            {term={category_description=bar}},
            {match={category_description={fuzziness=AUTO, query=bar, operator=and}}
        }],
         must=[{term={type=BUSINESS}}]}
    },
from=0}

And it has definitely solved my problem. I am still testing the other queries to make sure they are behaving properly but it sure looks like it! If you had a minute to look over my changes to confirm I understood you that would be fantastic. And as I mentioned before, I wasn't quite sure what you mean by minus the n-grams.

Thanks again!


(Mark Harwood) #6

You can index the one source field in multiple different ways e.g. With an edge ngram based analyzer and also with a standard analyzer. They end up as 2 different named fields in the search index. See https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html


(RobBob) #7

Ahh okay I see. And you had mentioned I use the standard analyzed field for 1) and 2)?


(Mark Harwood) #8

That would make sense. I was illustrating a general pattern of using a range of matching methods (exact through to partial) where each can be given different levels of boost.
Ultimately it's up to you how much disk space/CPU/disk-seeks you want to throw at the matching problem with all these different approaches.


(RobBob) #9

Okay, thank you for all your help. You have definitely enlightened me to the options I have though they seem more obvious now. I don't know how I missed this line of thought but I appreciate your help! Thank you :slight_smile:


(system) #10

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.