Synonym Token Filter questions

jjarthur · August 26, 2019, 5:26am

I am attempting to implement synonyms into my search and ran into a couple of questions.

Firstly, I have a synonyms.txt file with the following mappings:

foo, bar, baz

When "baz" is queried in elasticsearch, it correctly returns the results for all three tokens but how do I tell ES to give a 'boost'/priority to "baz" (as that was the term that was actually searched)? Preferably, it would return all the "baz" results, and then the rest following.
Separately, I also have some trouble with multi-word synonyms. I have a synonyms.txt file with the following mappings:

internet corporation for assigned names and numbers => icann

When I query for 'internet corporation for assigned names and numbers', I want to return results with that whole phrase (or the abbreviated 'icann') only. At the moment, it is also returning results for the individual tokens, e.g. 'internet', 'corporation', etc. What do I need to put in my analyzer settings to achieve this? It's worth noting that this custom analyzer is only used for search_analyzer fields.
```
 "analysis": {
     "analyzer": {
         "synonym": {
             "tokenizer": "standard",
             "filter": ["lowercase", "synonym"]
         }
     },
     "filter": {
         "synonym": {
             "type": "synonym",
             "synonyms_path": "synonyms",
             "expand": false
         }
     }
 }
```

Any help is greatly appreciated -- cheers!

abdon · August 26, 2019, 8:07am

To get exact matches ranked higher than synonym matches you need write a query that searches with and without synonyms at the same time. For example a bool query like this:

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "my_field": "baz"
          }
        }
      ],
      "should": [
        {
          "match": {
            "my_field": {
              "query": "baz",
              "analyzer": "standard"
            }
          }
        }
      ]
    }
  }
}

The idea of the query above is that the optional should clause will only match exact hits, because it overrides the search analyzer. Documents that match this should clause will get a higher score.

If you want to search for phrases, then that is something you would not necessarily achieve with an analyzer. Take a look at using the match_phrase query instead. If you do want to go with an analyzer then check out this blog post for a useful pattern.

By the way, I see you are using the synonym filter. With multi-word synonyms, the synonym_graph filter will work better. You may want to switch to that filter.

jjarthur · August 27, 2019, 1:40am

Thanks for the resources @abdon. The biggest thing I was missing was using match_phrase instead of just match. This looks like it almost solves the problem on its own. I have also swapped to synonym_graph as it seems to fit my use-case better. I'll post back if I run into any more issues.

jjarthur · August 27, 2019, 3:58am

Okay, maybe jumped the gun a bit here. match_phrase works great for synonyms that are acronyms because it stops the tokenizing of each word in the query. Unfortunately however, it also stops tokenizing all other queries that do not have a synonym. Is there a simple way to have the match_phrase functionality apply only when a synonym is found?

abdon · August 27, 2019, 7:42am

Did you read this blog post that I shared in my earlier post? I think it covers exactly what you want to do.

jjarthur · August 27, 2019, 10:01pm

Sorry, yes I did. It seems to be a bit outdated (filters no longer support a 'tokenizer' field), but I was going to attempt to use one of the methods described today.

jjarthur · August 27, 2019, 11:42pm

I believe I have solved this for now using 'Autophrase with synonym step' in the linked blog post. My settings looks something like this:

    "analysis": {
        "analyzer": {
            "synonym": {
                "tokenizer": "standard",
                "filter": ["lowercase", "autophrase_synonym", "synonym"]
            }
        },
        "filter": {
            "autophrase_synonym": {
                "type": "synonym",
                "synonyms": [
                  "internet corporation for assigned names and numbers => internet_corporation_for_assigned_names_and_numbers"
                ]
            },
            "synonym": {
                "type": "synonym",
                "synonyms": [
                    "internet_corporation_for_assigned_names_and_numbers, icann", "foo, bar, baz"
                ]
            }
        }
    }

jjarthur · August 28, 2019, 1:28am

I tried your bool query idea posted above in order to rank exact matches higher in the results. The results returned a higher score for every entry, which leads to no changes in the rankings overall. I believe this is because my synonym filter needs to run on the index analyzer as well as the search analyzer.

Is there another approach I can take or any way around this that you are aware of?

abdon · August 28, 2019, 7:32am

You can use multi-fields to index your data twice: once with synonyms (using your synonym analyzer), and once without (using the standard analyzer). The docs have an example on how to do this.

In your bool query's must clause you can then query the field with synonyms, and in the should clause the field without synonyms.

jjarthur · August 29, 2019, 5:04am

This works very well. Thank you @abdon.

system · September 26, 2019, 5:04am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Match query with synonym token filter where synonym available on first & second paragraph only Elasticsearch	3	588	June 26, 2020
How to search a contraction word? Elasticsearch	1	597	September 20, 2018
Synonyms in Elastic search Elasticsearch	2	629	July 6, 2017
Synonym token filter question Elasticsearch	3	376	April 29, 2020
Synonyms and relevance Elasticsearch	4	855	July 5, 2017

Synonym Token Filter questions

Related topics