I'm evaluating Elasticsearch 7.6 and its handling of multi-term (multi-word) synonyms, and I'm having a lot of trouble figuring out how to make practical use of it. In short:
How can I take a user's query and present it to Elasticsearch in a way that it will expand multi-term synonyms correctly?
Here's what I've done so far:
Based on the documentation's recommendations, I'm using the synonym_graph
filter to load my synonym list.:
"filter": {
"graph_synonyms": {
"type": "synonym_graph",
"synonyms_path": "my_synonyms.txt",
"updateable": true
}
}
I have implemented an analyzer that uses the above filter:
"synonym_analyzer": {
"tokenizer": "keyword",
"filter": [
"lowercase",
"graph_synonyms"
]
}
After some experimentation, I discovered that I needed to use the keyword
tokenizer to handle the multi-term synonyms in my synonym list correctly. (The whitespace
and standard
tokenizers both incorrectly broke up multi-term entries in the list, and thus failed to map those synonyms properly.)
In addition, I have created an analyzer that uses the standard tokenizer to break up the source text:
"analyzer-standard": {
"type": "custom",
"tokenizer": "standard",
"filter": "lowercase"
}
Finally, I set up the field mapping so that the field I want to search uses the custom analyzer above as well as a search_analyzer
that applies my synonym filter at search time:
"text_field": {
"type": "text",
"analyzer": "analyzer-standard",
"search_analyzer": "synonym_analyzer"
}
Now, let's go right to an example. I have the following synonym pair in my list:
sri lanka, ceylon
Let's say that a user searches for the following:
Sri Lanka history 1972
How can I give this query to Elasticsearch so that it properly expands "Sri Lanka" to "Ceylon" when it executes the search?
I've tried with a simple_query_string
query, as below:
"query": {
"simple_query_string": {
"query": "Sri Lanka history 1972",
"fields": [
"text_field"
],
"default_operator": "and"
}
But this will only find records with "Sri Lanka", not with "Ceylon".
I found another topic in which a user reported a similar problem:
@abdon offers a helpful suggestion: Use the flags
parameter so that the query parser doesn't break the query up by spaces, and will instead allow the synonym filter to find a match.
This solution works just fine, but only if we search for "Sri Lanka" (or any other multi-term synonym) alone, like this:
"query": {
"simple_query_string": {
"query": "sri lanka",
"fields": [
"text_field"
],
"default_operator": "and",
"flags": "OR|AND|PREFIX"
}
But that's not how we receive free text queries from our users. They send us queries that mix synonyms with other terms.
Does this mean that we have to pre-parse the query for Elasticsearch, and use our own code to detect if the user happens to search for a multi-term synonym, so that it can be handed to Elasticsearch correctly? Wouldn't that defeat the entire purpose of using Elasticsearch's search-time query parser? If we have to handle all of the logic behind synonyms in our own code, then we don't need Elasticsearch to do the query expansion for us; we could do it ourselves.
I feel that I must be missing something fundamental here, because I can't understand how to put Elasticsearch's support for multi-term synonyms into practice.
Thanks in advance for any help or insight!