Synonym Token Filter questions

I am attempting to implement synonyms into my search and ran into a couple of questions.

  1. Firstly, I have a synonyms.txt file with the following mappings:

    foo, bar, baz

    When "baz" is queried in elasticsearch, it correctly returns the results for all three tokens but how do I tell ES to give a 'boost'/priority to "baz" (as that was the term that was actually searched)? Preferably, it would return all the "baz" results, and then the rest following.

  2. Separately, I also have some trouble with multi-word synonyms. I have a synonyms.txt file with the following mappings:

    internet corporation for assigned names and numbers => icann

    When I query for 'internet corporation for assigned names and numbers', I want to return results with that whole phrase (or the abbreviated 'icann') only. At the moment, it is also returning results for the individual tokens, e.g. 'internet', 'corporation', etc. What do I need to put in my analyzer settings to achieve this? It's worth noting that this custom analyzer is only used for search_analyzer fields.

     "analysis": {
         "analyzer": {
             "synonym": {
                 "tokenizer": "standard",
                 "filter": ["lowercase", "synonym"]
             }
         },
         "filter": {
             "synonym": {
                 "type": "synonym",
                 "synonyms_path": "synonyms",
                 "expand": false
             }
         }
     }
    

Any help is greatly appreciated -- cheers!

To get exact matches ranked higher than synonym matches you need write a query that searches with and without synonyms at the same time. For example a bool query like this:

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "my_field": "baz"
          }
        }
      ],
      "should": [
        {
          "match": {
            "my_field": {
              "query": "baz",
              "analyzer": "standard"
            }
          }
        }
      ]
    }
  }
}

The idea of the query above is that the optional should clause will only match exact hits, because it overrides the search analyzer. Documents that match this should clause will get a higher score.

If you want to search for phrases, then that is something you would not necessarily achieve with an analyzer. Take a look at using the match_phrase query instead. If you do want to go with an analyzer then check out this blog post for a useful pattern.

By the way, I see you are using the synonym filter. With multi-word synonyms, the synonym_graph filter will work better. You may want to switch to that filter.

Thanks for the resources @abdon. The biggest thing I was missing was using match_phrase instead of just match. This looks like it almost solves the problem on its own. I have also swapped to synonym_graph as it seems to fit my use-case better. I'll post back if I run into any more issues.

Okay, maybe jumped the gun a bit here. match_phrase works great for synonyms that are acronyms because it stops the tokenizing of each word in the query. Unfortunately however, it also stops tokenizing all other queries that do not have a synonym. Is there a simple way to have the match_phrase functionality apply only when a synonym is found?

Did you read this blog post that I shared in my earlier post? I think it covers exactly what you want to do.

Sorry, yes I did. It seems to be a bit outdated (filters no longer support a 'tokenizer' field), but I was going to attempt to use one of the methods described today.

I believe I have solved this for now using 'Autophrase with synonym step' in the linked blog post. My settings looks something like this:

    "analysis": {
        "analyzer": {
            "synonym": {
                "tokenizer": "standard",
                "filter": ["lowercase", "autophrase_synonym", "synonym"]
            }
        },
        "filter": {
            "autophrase_synonym": {
                "type": "synonym",
                "synonyms": [
                  "internet corporation for assigned names and numbers => internet_corporation_for_assigned_names_and_numbers"
                ]
            },
            "synonym": {
                "type": "synonym",
                "synonyms": [
                    "internet_corporation_for_assigned_names_and_numbers, icann", "foo, bar, baz"
                ]
            }
        }
    }

I tried your bool query idea posted above in order to rank exact matches higher in the results. The results returned a higher score for every entry, which leads to no changes in the rankings overall. I believe this is because my synonym filter needs to run on the index analyzer as well as the search analyzer.

Is there another approach I can take or any way around this that you are aware of?

You can use multi-fields to index your data twice: once with synonyms (using your synonym analyzer), and once without (using the standard analyzer). The docs have an example on how to do this.

In your bool query's must clause you can then query the field with synonyms, and in the should clause the field without synonyms.

This works very well. Thank you @abdon.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.