Individual Tokens are not searched when a synonym rule is matched

Elasticsearch 7.10.2

Description:
When using synonym_graph filter, and if a synonym rule is matched, Elastic is not searching for individual tokens.

I have defined below synonym rule.

"synonyms": [
                          "Country Federal Police, CFP"
                          ]

When I search for "Country Federal Police", I get documents containing "CFP" and "Country Federal Police" (All 3 words together). It does not match documents containing "Country" , "Federal" and "Police" tokens individually. However if I remove the synonym rule, document containing any 1 of the 3 tokens are also returned. I am using standard tokenizer.

Is this expected behaviour? I would expect it to consider individual tokens of original search string.

Step to reproduce:

Mapping

PUT testsynonymgraph
{
    "settings": {        
            "analysis": {
                "filter": { 
                    "search_synonyms": {
                        "type": "synonym_graph",
                        "synonyms": [
                          "Country Federal Police, CFP"
                          ]
                    }
                },
                "analyzer": {
                    "default_search": {
                        "filter": ["lowercase", "asciifolding", "search_synonyms", "stop", "kstem"],
                        "type": "custom",
                        "tokenizer": "standard"
                    },
                    "default": {
                        "filter": ["lowercase", "asciifolding", "stop", "kstem"],
                        "type": "custom",
                        "tokenizer": "standard"
                    }         
                }
            }
    },
"mappings": {
        "properties": {           
           "Name": {
                "type": "text"
            }          
        }
    }
}

Indexing 4 documents

POST _bulk
{ "index" : { "_index" : "testsynonymgraph", "_id" : "1" } }
{ "FaCS Name" : "Country Federal Police" }
{ "index" : { "_index" : "testsynonymgraph", "_id" : "2" } }
{ "FaCS Name" : "Country Defence Force" }
{ "index" : { "_index" : "testsynonymgraph", "_id" : "3" } }
{ "FaCS Name" : "Country Reserve Police" }
{ "index" : { "_index" : "testsynonymgraph", "_id" : "4" } }
{ "FaCS Name" : "CFP" }

Search:

POST /testsynonymgraph/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "Country Federal Police",
            "fuzziness": "Auto"
          }
        }
      ]
    }
  },
  "size": 10
}

Search Result:

"hits" : [
      {
        "_index" : "testsynonymgraph",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.08334,
        "_source" : {
          "FaCS Name" : "Country Federal Police"
        }
      },
      {
        "_index" : "testsynonymgraph",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 1.5956267,
        "_source" : {
          "FaCS Name" : "CFP"
        }
      }
    ]

Problems:
Other documents "Country Federal Police" and "Country Reserve Police" should also be returned in result.

I'd consider that to be desirable behaviour.
The synonym rule defines that multiple words collectively mean a single thing e.g. eye liner and eyeliner both define the thing Robert Smith puts on his face and shouldn't match a cruise liner. There are many of these compound words and the multi-word version is treated as a single entity.
If you want the partial matching behaviour still use a sub-field with a different choice of Analyzer and query both fields (the one with synonyms and the one without).

Thanks for your response Mark. I get the point.

I added another document as below.

{ "index" : { "_index" : "testsynonymgraph", "_id" : "6" } }
{ "FaCS Name" : "DFP" }

On below query, it is applying fuzziness for single word synonyms.

POST /testsynonymgraph/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "Country Federal Police",
            "fuzziness": "Auto"
          }
        }
      ]
    }
  },
  "size": 10
}

Query Result: DFP is returned in result (I assume, fuzziness is applied on synonym cfp). I thought fuzziness is not applied when synonym rule is matched.

It is discussed here as well: Synonyms break fuzziness · Issue #25518 · elastic/elasticsearch · GitHub

"hits" : [
      {
        "_index" : "testsynonymgraph",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 2.9233453,
        "_source" : {
          "FaCS Name" : "CF Police"
        },
        "highlight" : {
          "FaCS Name" : [
            "<em>CF</em> <em>Police</em>"
          ]
        }
      },
      {
        "_index" : "testsynonymgraph",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 2.395439,
        "_source" : {
          "FaCS Name" : "Country Federal Police"
        },
        "highlight" : {
          "FaCS Name" : [
            "<em>Country</em> <em>Federal</em> <em>Police</em>"
          ],
          "FaCS Name.keyword" : [
            "<em>Country Federal Police</em>"
          ]
        }
      },
      {
        "_index" : "testsynonymgraph",
        "_type" : "_doc",
        "_id" : "6",
        "_score" : 1.3402741,
        "_source" : {
          "FaCS Name" : "DFP"
        },
        "highlight" : {
          "FaCS Name" : [
            <em>DFP</em>
          ]
        }
      }
    ]

The docs are not fully clear on this.
If you add this doc to your index you should see the discrepancy:

{ "index" : { "_index" : "testsynonymgraph", "_id" : "7" } }
{ "Name" : "County Federal Police" }

Note the use of County instead of Country
This doesn't fuzzy-match your query whereas DFP does. That's because multi-word synonyms (Country+Federal+Police) are run as phrase queries whereas single-word synonyms (like CFP) can use fuzzy.

1 Like

Thanks Mark for clarifying this.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.