Synonyms for Sankt and St

Kurzer_Patrick · March 29, 2018, 9:00am

Hi there,

I'm trying to get synonyms working for my existing setup. Currently I have this settings:

PUT city
{
	"settings": {
		"analysis": {
			"analyzer": {
				"autocomplete": {
					"tokenizer": "autocomplete",
					"filter": [
						"lowercase",
						"german_normalization",
						"my_ascii_folding"
					]
				},
				"autocomplete_search": {
					"tokenizer": "lowercase",
					"filter": [
						"lowercase",
						"german_normalization",
						"my_ascii_folding"
					]
				}
			},
			"filter": {
				"my_ascii_folding": {
					"type": "asciifolding",
					"preserve_original": true
				}
			},
			"tokenizer": {
				"autocomplete": {
					"type": "edge_ngram",
					"min_gram": 1,
					"max_gram": 15,
					"token_chars": [
						"letter",
						"digit",
						"symbol"
					]
				}
			}
		}
	},
	"mappings": {
		"city": {
			"properties": {
				"name": {
					"type": "text",
					"analyzer": "autocomplete",
					"search_analyzer": "autocomplete_search"
				}
			}
		}
	}
}

In this City Index I have documents like that:

St. Wolfgang or Sankt Wolfgang and so on. For me St. and Sankt are synonyms. So if I search for Sankt both of the documents should appear.

I created a new Filter:

"my_synonym_filter": {
   "type": "synonym",
    "ignore_case": "true",
    "synonyms": [
    	"sankt, st."
    ]
}

So good for now. But the issues I faced are following:

Its clear that the dot after st is not analyzed and not searchable at the moment. But For the synonym the dot is important.

The second issue is if I search for sankt the synonym is also st which gives me also all documents which starts with st like Stuttgart. So this happens also because the dot is not used.

Do you have any idea how I can achieve the stuff? If you need any more information, please let me know.

Thanks in advance

abdon · March 30, 2018, 8:20am

You could modify your tokenizer to not remove punctuation by adding "punctuation" to the "token_chars" in your autocomplete tokenizer. However, maybe there is some punctuation that you do want to remove?

Alternatively you could normalize "St." to "Sankt" in the character filter phase, which takes place before tokenization. You could for example add the following character filter definition to the analysis settings:

      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "St. => sankt",
            "st. => sankt"
          ]
        }
      }

and update your analyzers to use this char filter:

        "autocomplete": {
          "char_filter": [
            "my_char_filter"
          ],
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase",
            "german_normalization",
            "my_ascii_folding"
          ]
        },
        "autocomplete_search": {
          "char_filter": [
            "my_char_filter"
          ],
          "tokenizer": "lowercase",
          "filter": [
            "lowercase",
            "german_normalization",
            "my_ascii_folding"
          ]
        }

Kurzer_Patrick · March 30, 2018, 8:49am

Thanks abdon for your solution. I will check that.

I also got a solution which worked for me. I had to change my analyzer. This is the final setting I use for now:

{
	"settings": {
		"analysis": {
			"analyzer": {
				"autocomplete": {
					"tokenizer": "whitespace",
					"filter": [
						"lowercase",
						"my_synonym_filter",
						"german_normalization",
						"edge_filter"
					]
				},
				"autocomplete_search": {
					"tokenizer": "whitespace",
					"filter": [
						"lowercase",
						"my_synonym_filter",
						"german_normalization"
					]
				}
			},
			"filter": {
				"edge_filter": {
					"type": "edgeNGram",
					"min_gram": 1,
					"max_gram": 15
				},
				"my_synonym_filter": {
					"type": "synonym",
					"ignore_case": "true",
					"synonyms": [
						"sankt, st., santa, saint",
						"sveti, sv."
					]
				}
			}
		}
	},
	"mappings": {
		"city": {
			"properties": {
				"name": {
					"type": "text",
					"analyzer": "autocomplete",
					"search_analyzer": "autocomplete_search"
				}
			}
		}
	}
}

what do you think?

abdon · March 30, 2018, 8:52am

Looks good to me. You switched to a tokenizer that doesn't remove punctuation (because of the whitespace tokenizer). If that works for you: great!

system · April 27, 2018, 8:53am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search analyser + preserve special characters Elasticsearch	1	602	December 8, 2020
Unable to bypass restriction with synonym token filter Elasticsearch	7	1058	October 3, 2019
Synonym Filter Elasticsearch	2	374	July 6, 2017
Synonyms with Keyword Tokenizer Elasticsearch	2	969	July 6, 2017
Problem with synonym token filter Elasticsearch	8	504	July 6, 2017

Synonyms for Sankt and St

Related topics