Synonym token filter and edge_ngram tokenizer conflicts

Hello,

I'm facing a small issue with my implementation of the synonym token filter combined with an edge_ngram tokenizer.

For example, I have a synonym mapping for sa => south_australia and an entry with a field of sandcastles are fun. Now if I query Elasticsearch for south_australia, it will also return my sandcastles are fun entry in the results. This happens because the edge_ngram tokenizer has indexed that field as {s, sa, san, sand, ... } and so when south_australia is mapped to sa, it unfortunately gets picked up.

I have added a minified version of my settings file that has the same issue:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "synonym": {
                    "tokenizer": "autocomplete",
                    "filter": ["lowercase", "synonym"]
                }
            },
            "filter": {
                "synonym": {
                    "type": "synonym",
                    "synonyms_path": "synonyms"
                }
            },
            "tokenizer": {
                "autocomplete": {
                    "type": "edge_ngram",
                        "min_gram": 2,
                        "max_gram": 20,
                        "token_chars": [
                            "letter",
                            "digit"
                        ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "standard",
                "fields": {
                  "synonyms": {
                    "type": "text",
                    "analyzer": "synonym"
                  }
                }
            }
        }
    }
}

Curious if anyone has dealt with this before or has some advice to get around it?

As you seem to need the edge_ngram tokenizer for autocomplete functionality only, would it be possible to use two fields instead, one for the autocomplete functionality and one that uses the synonyms that then can be used to increase search recall? Another option might be to apply synonyms only at query time.

I was under the impression that the way I had set up my synonyms, I had to apply the synonym token filter at both index time and query time. Thank you for your suggestion -- this seemed to have fixed my problem as well as not breaking anything else in the process (I think.. :slight_smile:) .

After some more testing, I found that this fixed the cases of incomplete token mapping, e.g. a query for sou will no longer map to sa (via sout => south_australia => sa). However, the problem still persists for a full synonym match, e.g. a query for south_australia will still map to sa.

I'm looking at your other suggestion and not sure I fully understand. Do you mean something like:

{
    "settings": {
        "analysis": {
            "analyzer": {
                "synonym": {
                    "tokenizer": "standard",
                    "filter": ["lowercase", "synonym"]
                },
                "autocomplete": {
                    "tokenizer": "autocomplete",
                    "filter": ["lowercase"]
                }
            },
            "filter": {
                "synonym": {
                    "type": "synonym",
                    "synonyms_path": "synonyms"
                }
            },
            "tokenizer": {
                "autocomplete": {
                    "type": "edge_ngram",
                        "min_gram": 2,
                        "max_gram": 20,
                        "token_chars": [
                            "letter",
                            "digit"
                        ]
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "standard",
                "fields": {
                  "synonyms": {
                    "type": "text",
                    "analyzer": "synonym"
                  },
                  "autocomplete": {
                     "type": "text",
                     "analyzer": "autocomplete"
                  }
                }
            }
        }
    }
}

Yes, at a quick glance your last example looks like what I wanted to point out to you: use two independent analyzers for autocomplete and regular search, so that synonyms don't interfere with edge ngrams. I don't know if this completely works in your use case, but it should make things easier to reason about that if mixing autocomplete functionality and search on the same field
Also I wonder if you ever tried the dedicated completion suggester for the autocomplete functionality? Maybe you did and discarded it for some reason, just curious and don't want to move this thread towards suggesters but maybe worth another look.

I ended up implementing both of your suggestions. It is very similar to the above snippet, except I am only using the synonyms analyzer at search time and only using the autocomplete analyzer at index time.

In regards to the completion suggester... to be honest I had no idea that existed! I have just gotten everything working as is, but will keep that in mind for future changes.

Thanks for taking the time, I appreciate the feedback!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.