Unable to bypass restriction with synonym token filter

Greetings,

I am trying to migrate from ES 5.5 to 7.3, but I got stuck at an issue with the synonym filter. The reference states "This filter tokenize synonyms with whatever tokenizer and token filters appear before it in the chain." . From it I understand that when I define a custom analyzer, the synonyms will get tokenized and filtered by previous token filters, but I don't want that. I need my synonims untouched. Here is a little example to show what I am doing:

PUT  index_1
{
  "mappings": {
	"properties": {
	  "text": {
		"analyzer": "custom_analyzer_all",
		"norms": false,
		"type": "text"
	  }
	}
  },
  "settings": {
	"index.number_of_replicas": 1,
	"analysis": {
	  "analyzer": {
		"custom_analyzer_all": {
		  "char_filter": [
		  ],
		  "filter": [
			"lowercase",
			"custom_lemma",
			"custom_wordpack"
		  ],
		  "tokenizer": "standard",
		  "type": "custom"
		}
	  },
	  "char_filter": {
	  },
	  "filter": {
		"custom_wordpack": {
		  "type": "synonym",
		  "tokenizer": "whitespace",
		  "synonyms": [
			"gym => gym, _amenities"
		  ]
		},
		"custom_lemma": {
		  "type": "lemmagen",
		  "lexicon": "en"
		}
	  }
	},
	"index.number_of_shards": 1
  }
}

I am creating an index with a custom analyzer that uses lowercase, synonym and lemmagen filter. Lemmagen is a plugin(Link to repo ).
Insert a sample document.

POST index_1/_doc/1
{
  "text":"Great to see Ivan can get things right in a gym - clean and everything works."
}

Now when I check the terms of the text field with _termvectors API:

POST index_1/_termvectors/1
{
  "fields": ["text"]
}

I get this for the added synonym:

{
	"_amenity" : {
	  "term_freq" : 1,
	  "tokens" : [
		{
		  "position" : 10,
		  "start_offset" : 44,
		  "end_offset" : 47
		}
	  ]
	}
}

"_amenity" is the lemmatized version of the synonym "_amenities". Even if I swap the places of the custom_lemma and custom_wordpack filters, it will still have the same effect(the synonym is inserted in its original form, but the custom_lemma filter again lematizes it). I need somehow to prevent the synonyms from getting analyzed by the filters or tokenizer. It used to be possible to specify a tokenizer inside the synonym token filter, but that no longer works. Could someone please suggest anything on how to have the text field analyzed as it is, but the synonyms are left untouched.

Welcome to the Elasticsearch forums! :wave:

This is an interesting case. It looks like the behavior changed with 6.0.0. (See under "Analysis" in the release notes' New Features.) You're right about the way that synonyms go through the entire token filter chain regardless of where the synonym filter is placed. For anyone following along, you can reproduce this behavior without installing any plugins by using a stemmer token filter:

PUT index_2
{
  "mappings": {
    "properties": {
      "text": {
        "analyzer": "stems_then_synonyms",
        "norms": false,
        "type": "text"
      }
    }
  },
  "settings": {
    "index.number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "stems_then_synonyms": {
          "char_filter": [],
          "filter": [
            "lowercase",
            "my_stemmer",
            "my_synonyms"
          ],
          "tokenizer": "standard",
          "type": "custom"
        },
        "synonyms_then_stems": {
          "char_filter": [],
          "filter": [
            "lowercase",
            "my_synonyms",
            "my_stemmer"
          ],
          "tokenizer": "standard",
          "type": "custom"
        }
      },
      "char_filter": {},
      "filter": {
        "my_synonyms": {
          "type": "synonym",
          "tokenizer": "whitespace",
          "synonyms": [
            "gym => gym, _amenities"
          ]
        },
        "my_stemmer": {
          "type": "stemmer",
          "name": "english"
        }
      }
    },
    "index.number_of_shards": 1
  }
}

POST index_2/_analyze
{
  "analyzer": "stems_then_synonyms",
  "text": "gym"
}

POST index_2/_analyze
{
  "analyzer": "synonyms_then_stems",
  "text": "gym"
}

Both of those calls to _analyze result in the same thing:

{
  "tokens" : [
    {
      "token" : "gym",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "_amen",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "SYNONYM",
      "position" : 0
    }
  ]
}

Can you tell us a little bit more about your use case? Are you using the synonym token filter as a way of adding document tags? I wonder if you could use a multi field to capture both kinds of analysis for every document:

PUT index_3
{
  "mappings": {
    "properties": {
      "text": {
        "analyzer": "custom_analyzer_lemma",
        "norms": false,
        "type": "text",
        "fields": {
          "synonymized": {
            "analyzer": "custom_analyzer_wordpack",
            "type": "text"
          }
        }
      }
    }
  },
  "settings": {
    "index.number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "custom_analyzer_lemma": {
          "char_filter": [],
          "filter": [
            "lowercase",
            "custom_lemma"
          ],
          "tokenizer": "standard",
          "type": "custom"
        },
        "custom_analyzer_wordpack": {
          "char_filter": [],
          "filter": [
            "lowercase",
            "custom_wordpack"
          ],
          "tokenizer": "standard",
          "type": "custom"
        }
      },
      "char_filter": {},
      "filter": {
        "custom_wordpack": {
          "type": "synonym",
          "tokenizer": "whitespace",
          "synonyms": [
            "gym => gym, _amenities"
          ]
        },
        "custom_lemma": {
          "type": "lemmagen",
          "lexicon": "en"
        }
      }
    },
    "index.number_of_shards": 1
  }
}

PUT index_3/_doc/1
{
  "text": "cars park at the gym"
}

POST index_3/_termvectors/1
{
  "fields": ["text", "text.synonymized"]
}

-William

Hi @William_Brafford
Thank you for taking the time to look at my issue. Yes, you are correct. I am using the synonym filter to tag texts and then perform aggregations on that tag. I have tried the multi field approach, but it doesn't work for me. I need to stem/lemmatize the field before matching it with the synonym. In your example with multi field, "synonymized" would hold the original text's terms lowercased only. Let's say the example text doesn't have the word "gym", but its plural form "gyms" instead. In that case the synonym filter won't capture "gyms", but if it is lemmatized then "gyms" would become "gym" and the filter will add the tag to the terms.
There are other workarounds like putting all the possible forms of the given word in the synonym filter, but I'd rather not do that. Another way is just to use the lemmatized form of the tag instead of whatever I want, but I am still considering this.

@radoslav.sholev,

Thanks for explaining the use case. I think I understand the issue better now.

I'm looking at some of the discussions around when this change happened (before my time as an employee), and it seems like we intended the synonym_graph search-time filter (docs here) to cover some of the cases that the old synonym filter used to cover before 6.0. See, for example, the discussion here:

I tried it out in Kibana. It looks like what I'm doing here doesn't store the tag on the document, but does let you use the tag in search:

PUT index_4
{
  "mappings": {
    "properties": {
      "text": {
        "analyzer": "custom_index_analyzer",
        "norms": false,
        "type": "text"
      }
    }
  },
  "settings": {
    "index.number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "custom_index_analyzer": {
          "char_filter": [],
          "filter": [
            "lowercase",
            "custom_lemma"
          ],
          "tokenizer": "standard",
          "type": "custom"
        },
        "custom_search_analyzer": {
          "char_filter": [],
          "filter": [
            "lowercase",
            "custom_wordpack"
          ],
          "tokenizer": "standard",
          "type": "custom"
        }
      },
      "char_filter": {},
      "filter": {
        "custom_wordpack": {
          "type": "synonym_graph",
          "synonyms": [
            "gym, _amenities"
          ]
        },
        "custom_lemma": {
          "type": "lemmagen",
          "lexicon": "en"
        }
      }
    },
    "index.number_of_shards": 1
  }
}

At index time, we use the LemmaGen analyzer on the text field. The tag isn't added.

However, when a search for _amenities using the search-time analyzer will search for the gym token.

POST index_4/_analyze
{
  "analyzer": "custom_search_analyzer",
  "text": "_amenities"
}

Result:

{
  "tokens" : [
    {
      "token" : "gym",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "SYNONYM",
      "position" : 0
    },
    {
      "token" : "_amenities",
      "start_offset" : 0,
      "end_offset" : 10,
      "type" : "<ALPHANUM>",
      "position" : 0
    }
  ]
}

Now if we index one document with "gym" and another with "gyms":

PUT index_4/_doc/1
{
  "text": "nice chairs at the gym"
}

PUT index_4/_doc/2
{
  "text": "there are plenty of gyms in the city"
}

Our query will find and highlight both terms:

GET index_4/_search
{
  "query": {
    "match": {
      "text": {
        "query": "_amenities",
        "analyzer": "custom_search_analyzer"
      }
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}

Result:

{
  [...],
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.27884474,
    "hits" : [
      {
        "_index" : "index_4",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.27884474,
        "_source" : {
          "text" : "nice chairs at the gym"
        },
        "highlight" : {
          "text" : [
            "nice chairs at the <em>gym</em>"
          ]
        }
      },
      {
        "_index" : "index_4",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.27884474,
        "_source" : {
          "text" : "there are plenty of gyms in the city"
        },
        "highlight" : {
          "text" : [
            "there are plenty of <em>gyms</em> in the city"
          ]
        }
      }
    ]
  }
}

And here's a stab at an aggregation that counts documents matching tags:

GET index_4/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "tag_agg": {
      "filters": {
        "filters": {
          "amenities": {
            "match": {
              "text": {
                "query": "_amenities",
                "analyzer": "custom_search_analyzer"
              }
            }
          },
          "no_amenities": {
            "bool": {
              "must_not": {
                "match": {
                  "text": {
                    "query": "_amenities",
                    "analyzer": "custom_search_analyzer"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

Result:

{
  [...],
  "aggregations" : {
    "tag_agg" : {
      "buckets" : {
        "amenities" : {
          "doc_count" : 2
        },
        "no_amenities" : {
          "doc_count" : 0
        }
      }
    }
  }
}

I think this approach gets closer to your use-case than the multi-field suggestion does, though I am not sure if it fully solves the problem. Please let me know what you think.

-William

Hi @William_Brafford

Sorry for the late response.
The solution looks great, I haven't considered using a seperate analyzer for searches before and it opens an entirely new perspective for search queries. I have been playing with this to see how I can fit it in my project. Sadly I can't get around the missing term(_amenities) in the _termvectors. I need it, since I am using a lot of span queries for additional analysis of the texts. I have tried converting the spans into intervals query, but again hit the same bump, since "match" query requires terms in order to calculate max_gaps.

I ended up replacing the the tag with an unique identifier. This way nor the tokenizer, nor the lemmagen filter change the term. In the UI the user can still write his own tag name, but in the backend it is replaced with the ID in order to perform searches and adding new synonym groups to the index. It's a cheap and easy workaround.

Thank you for the support @William_Brafford!

1 Like

@radoslav.sholev ,

I'm glad this worked for you! I learned a lot while thinking through your use case. Thanks for the detailed write-ups.

-William

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.