Unexpected Behavior of OR Match Query With Synonym Graph

I don't know if the following behavior is intended or not. See the following example of index definition and documents:

PUT /test
{
  "settings": {
    "analysis": {
      "filter": {
          "syn": {
            "synonyms": ["ysl, yves saint laurent"],
            "type": "synonym_graph"
          }
      },
      "analyzer": {
        "index": {
          "type": "custom", 
          "tokenizer": "standard",
          "filter": [ "lowercase", "asciifolding", "trim"]
        },
        "query": {
          "type": "custom", 
          "tokenizer": "standard",
          "filter": [ "lowercase", "asciifolding", "trim", "syn"]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "field1": { "type": "text", "analyzer": "index", "search_analyzer": "query" }
    }
  }
}

POST test/_doc/1
{
  "field1": "saint nicolas"
}

POST test/_doc/2
{
  "field1": "new ysl bag"
}

POST test/_doc/3
{
  "field1": "new yves saint laurent shoes"
}

When I run query

GET test/_search
{
  "query": {
    "match": {
      "field1": {
        "query": "yves saint laurent",
        "operator": "or"
      }
    }
  }
}

I get back documents 2 and 3. Not 1. Why not 1? Since I specified operator OR, should not presence of the token saint be enough to return document 1?

If I run query

GET test/_search
{
  "query": {
    "match": {
      "field1": {
        "query": "yves saint",
        "operator": "or"
      }
    }
  }
}

I do get back documents 1 and 3, which is expected.

Just for info, when I run

GET test/_analyze
{
  "analyzer": "query",
  "text": "yves saint laurent"
}

I get back

{
  "tokens" : [
    {
      "token" : "ysl",
      "start_offset" : 0,
      "end_offset" : 18,
      "type" : "SYNONYM",
      "position" : 0,
      "positionLength" : 3
    },
    {
      "token" : "yves",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "saint",
      "start_offset" : 5,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "laurent",
      "start_offset" : 11,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

So I can see there token saint being present, that makes me unsure whether behavior described above is expected or not.

I am using Elasticsearch 7.10

I reproduced the behavior with 8.10.2, which you should use BTW as it has a brand new API for synonyms management :wink:

It looks like a bug to me, but let me check that a bit more... I'll get back to you.

Thanks for the detailed report!

I appears that this is giving a clue:

GET test/_validate/query?rewrite=true
{
  "query": {
    "match": {
      "field1": {
        "query": "yves saint laurent"
      }
    }
  }
}

Which gives how the query is actually rewritten:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": "test",
      "valid": true,
      "explanation": "field1:ysl field1:\"yves saint laurent\""
    }
  ]
}

As yves saint laurent is exactly a synonym, the query is rewritten as a phrase query.
If you try with saint laurent, it will work as you were expecting.

If you change the query to this:

GET test/_validate/query?rewrite=true
{
  "query": {
    "match": {
      "field1": {
        "query": "yves saint laurent saint"
      }
    }
  }
}

This will produce:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "valid": true,
  "explanations": [
    {
      "index": "test",
      "valid": true,
      "explanation": "field1:ysl field1:\"yves saint laurent\" field1:saint"
    }
  ]
}

I'd say that's a "side effect" of the the way the synonym_graph is working behind the scene. But I'm not sure there's anything to do to "fix" that at it seems to be expected...

And my colleague @jpountz just told me about auto_generate_phrase_queries.
Could you try with this option set to false?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.