Multi-term synonyms: How can this be used in practice?

I'm evaluating ElasticSearch 7.6 and its handling of multi-term (multi-word) synonyms, and I'm having a lot of trouble figuring out how to make practical use of it. In short:

How can I take a user's query and present it to ElasticSearch in a way that it will expand multi-term synonyms correctly?

Here's what I've done so far:

Based on the documentation's recommendations, I'm using the synonym_graph filter to load my synonym list.:

 "filter": {
    "graph_synonyms": {
      "type": "synonym_graph",
      "synonyms_path": "my_synonyms.txt",
      "updateable": true
    }
  }

I have implemented an analyzer that uses the above filter:

"synonym_analyzer": {
  "tokenizer": "keyword",
  "filter": [
    "lowercase",
    "graph_synonyms"
  ]
}

After some experimentation, I discovered that I needed to use the keyword tokenizer to handle the multi-term synonyms in my synonym list correctly. (The whitespace and standard tokenizers both incorrectly broke up multi-term entries in the list, and thus failed to map those synonyms properly.)

In addition, I have created an analyzer that uses the standard tokenizer to break up the source text:

"analyzer-standard": {
      "type": "custom",
      "tokenizer": "standard",
      "filter": "lowercase"
}

Finally, I set up the field mapping so that the field I want to search uses the custom analyzer above as well as a search_analyzer that applies my synonym filter at search time:

"text_field": {
    "type": "text",
    "analyzer": "analyzer-standard",
    "search_analyzer": "synonym_analyzer"
 }

Now, let's go right to an example. I have the following synonym pair in my list:

sri lanka, ceylon

Let's say that a user searches for the following:

Sri Lanka history 1972

How can I give this query to ElasticSearch so that it properly expands "Sri Lanka" to "Ceylon" when it executes the search?

I've tried with a simple_query_string query, as below:

"query": {
  "simple_query_string": {
    "query": "Sri Lanka history 1972",
    "fields": [
      "text_field"
    ],
    "default_operator": "and"
}

But this will only find records with "Sri Lanka", not with "Ceylon".

I found another topic in which a user reported a similar problem:

@abdon offers a helpful suggestion: Use the flags parameter so that the query parser doesn't break the query up by spaces, and will instead allow the synonym filter to find a match.

This solution works just fine, but only if we search for "Sri Lanka" (or any other multi-term synonym) alone, like this:

"query": {
  "simple_query_string": {
    "query": "sri lanka",
    "fields": [
      "text_field"
    ],
    "default_operator": "and",
    "flags": "OR|AND|PREFIX"
}

But that's not how we receive free text queries from our users. They send us queries that mix synonyms with other terms.

Does this mean that we have to pre-parse the query for ElasticSearch, and use our own code to detect if the user happens to search for a multi-term synonym, so that it can be handed to ElasticSearch correctly? Wouldn't that defeat the entire purpose of using ElasticSearch's search-time query parser? If we have to handle all of the logic behind synonyms in our own code, then we don't need ElasticSearch to do the query expansion for us; we could do it ourselves.

I feel that I must be missing something fundamental here, because I can't understand how to put ElasticSearch's support for multi-term synonyms into practice.

Thanks in advance for any help or insight!

My advice would be: don't use the simple_query_string query. Use the match query instead (or the multi_match query if you want to query multiple fields). These queries behave much better.

For example:

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "graph_synonyms": {
          "type": "synonym_graph",
          "synonyms": [
            "sri lanka, ceylon"
          ],
          "updateable": true
        }
      },
      "analyzer": {
        "synonym_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "graph_synonyms"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text_field": {
        "type": "text",
        "analyzer": "standard",
        "search_analyzer": "synonym_analyzer"
      }
    }
  }
}

PUT my_index/_doc/1
{
  "text_field": "Ceylon"
}

# No result
GET my_index/_search
{
  "query": {
    "simple_query_string": {
      "query": "Sri Lanka history 1972",
      "fields": [
        "text_field"
      ]
    }
  }
}

# Finds the document
GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "Sri Lanka history 1972",
      "fields": [
        "text_field"
      ]
    }
  }
}

Notice I'm using the standard tokenizer here; not the keyword tokenizer.

Thanks so much for the helpful reply.

The multi_match query does indeed seem to do the trick. I didn't realize from the documentation that simple_query_string doesn't support synonym expansion.

How would you suggest that I handle queries that come in using various operators (like AND, grouping, and wildcard prefixes) that are best suited to a simple_query_string query? In other words, how can I get the best of both worlds?

Let's say that a user searched for:

Ceylon hist*

In this case, we want to honor the wildcard operator, but we also want to expand "Ceylon" to match its synonym.

Would it make sense to use a boolean query, like this:

"bool": {
  "should": 
  [{
    "multi_match": {
      "query": "Ceylon hist",
      "type": "cross_fields",
      "fields": [
        "text_field",
        "text_field2"
      ],
      "operator": "and"
    }},
  {
    "simple_query_string": {
      "query": "Ceylon hist*",
      "fields": [
        "text_field",
        "text_field2"
      ],
      "default_operator": "and"
    }
  }]
}

I suppose I'd have to do some pre-processing of the search terms, such as by removing any search operators, before feeding them into the multi_match portion of the query. (If we leave the wildcard operator after "hist", then it doesn't find anything.)

Or do you perhaps know of a more elegant way to do this?

Thanks again for the help!

The simple_query_string query does support synonym expansion, but there are some limitations with multi-word synonyms, as you have noticed, because of its query parser.

I think your bool query solution is quite elegant. If it does the trick for you, then that would seem like a good solution to me.

Thanks again for the feedback.

I have done a little more testing with the bool query concept, in which I mix two types of queries to get the desired results. However, it quickly falls apart when I introduce the NOT operator into a query.

For example, if I feed the query:

"Ceylon -history"

into the JSON above, I end up with results that don't obey the "-history" restriction. That's because the multi_match query type doesn't support the NOT operator, and therefore won't remove documents that contain the term "history".

It seems like the only way to handle this properly is to go back to my original idea, which means developing a custom query parser to handle all allowed operators and break them into ElasticSearch query types that respect multi-term synonym expansion. That sounds like a huge job -- and one that I would expect ElasticSearch to be able to handle out-of-the-box.

Am I wrong in expecting ElasticSearch to do this? Is there another tool I should be using?

Thanks again in advance for any tips!