Getting a result that I don't want. How to check why

Hi,

I am using ElasticSearch 6.1.3 and have the following situation.
Let me start sharing the query I am using.

POST develop/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "omega 7",
            "fields": ["merk.naam^2", "categorie.naam^14", "naam^70", "did_you_mean"],
            "operator": "and"
          }
        }
      ]
    }
  },
  "track_scores": true,
  "sort": [
    {
      "type.keyword": {
        "order": "asc"
      }
    },
    {
      "populariteitscijfer": {
        "order": "desc"
      }
    }
  ]
}

With this query I get the results I want based on the scores and all (omega 3 is not appearing).
Seen in: http://drops.3ws.nl/qstszh

But when I change "omega 7" to "omega 3" the omega 7 category is also showing.
Seen in: http://drops.3ws.nl/NWHLl5

I cannot figure out why this is happening. Since the reversal does not include that category.
When I run the analyzer like this:

GET develop/_analyze
{
  "analyzer": "didYouMean",
  "text": ["omega 3"]
}

This is the result for it:

{
  "tokens": [
    {
      "token": "omega",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "visolie",
      "start_offset": 0,
      "end_offset": 7,
      "type": "SYNONYM",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "3",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<NUM>",
      "position": 1
    }
  ]
}

I am itching my head over this. Hope someone can help out.

Greetings,
Peter

Edit::

This is the contents of the field: did_you_mean with the query:

{
    "_index": "development-2018032812",
    "_type": "doc",
    "_id": "categorie-485",
    "_score": 4.3321357,
    "fields": {
      "did_you_mean": [
        "Omega-7"
      ]
    },
    "sort": [
      "categorie",
      -9223372036854776000
    ]
  },
  {
    "_index": "development-2018032812",
    "_type": "doc",
    "_id": "categorie-1",
    "_score": 632.3207,
    "fields": {
      "did_you_mean": [
        "Omega-3"
      ]
    },
    "sort": [
      "categorie",
      -9223372036854776000
    ]
  }

You can add "explain" : true to a query, to see how the score for each hit was calculated. That should help you figure out why a certain document is a hit.

If that doesn't help, could you post the full documents with the IDs categorie-485 and categorie-1, as well as your index settings and mappings (the output of GET develop), and I'll gladly take a look.

Hello Abdon,

Thank you for helping me out.
It was to many characters so i created a gist for it:

Thanks in advance, greetings

Thanks for posting the additional information.

You're getting back document categorie-485 as a hit for the query omega 3 because of the synonyms you have set up, specifically these synonym definitions: visolie, omega 3 and visolie, omega-3. These synonyms are applied at index time as well as at query time.

At index time, the did_you_mean field of document categorie-485 will contain the value of for example naam (through copy_to): The naam field contains the value Omega-7, which gets tokenized into:

GET develop/_analyze
{
  "analyzer": "didYouMean",
  "text": ["Omega-7"]
}

{
  "tokens": [
    {
      "token": "omega",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "visolie",
      "start_offset": 0,
      "end_offset": 7,
      "type": "SYNONYM",
      "position": 0,
      "positionLength": 2
    },
    {
      "token": "7",
      "start_offset": 6,
      "end_offset": 7,
      "type": "<NUM>",
      "position": 1
    }
  ]
}

All of these terms will end up in the inverted index, including visolie (as a synonym of omega).

At query time, the omega 3 query on the did_you_mean field will also query for the same visolie synonym, as you can see from the output of _analyze that you posted. That's why this document is a match: the query for the synonym visolie matches the document because of the term visolie (through synonyms).

Now, you should ask yourself if you really want to apply synonyms both at index time and at query time. Generally, that is not the case. It's double the work and it can lead to unexpected search results as you have experienced.

I'd go with query-time synonyms only. This is something you can achieve by setting up a search_analyzer that uses synonyms, and an analyzer that does not use synonyms. For the did_you _mean_field the mapping would become:

        "did_you_mean": {
          "type": "text",
          "analyzer": "standard", 
          "search_analyzer": "didYouMean"
        }

With that mapping, document categorie-485 is no longer a hit for the query omega 3. But synonyms still work. A query for visolie will still return document categorie-1.

Synonyms are tricky to set up properly. If you want to read more I can really recommend the excellent book "Relevant Search" by @softwaredoug that covers synonyms in great depth.

Thank you for the helpful reply. I thought the AND operator was also triggered for the query in which 3 is not 7. Is there a reason that is not triggered for this kind of query?

Just bought the book and will read up on it. The solution you gave helped me out.

Yes, there is an AND for omega 3, but it would not make sense to search for all synonyms with an AND. omega AND 3 AND visolie would probably not return any hits. Instead Elasticsearch will search for one of the synonyms. You can think of the query as: (omega AND 3) OR visolie.

Ah, and in this case it was visolie also for Omega 7.
Thank you for the clarification.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.