Synonyms for Sankt and St

Hi there,

I'm trying to get synonyms working for my existing setup. Currently I have this settings:

PUT city
{
	"settings": {
		"analysis": {
			"analyzer": {
				"autocomplete": {
					"tokenizer": "autocomplete",
					"filter": [
						"lowercase",
						"german_normalization",
						"my_ascii_folding"
					]
				},
				"autocomplete_search": {
					"tokenizer": "lowercase",
					"filter": [
						"lowercase",
						"german_normalization",
						"my_ascii_folding"
					]
				}
			},
			"filter": {
				"my_ascii_folding": {
					"type": "asciifolding",
					"preserve_original": true
				}
			},
			"tokenizer": {
				"autocomplete": {
					"type": "edge_ngram",
					"min_gram": 1,
					"max_gram": 15,
					"token_chars": [
						"letter",
						"digit",
						"symbol"
					]
				}
			}
		}
	},
	"mappings": {
		"city": {
			"properties": {
				"name": {
					"type": "text",
					"analyzer": "autocomplete",
					"search_analyzer": "autocomplete_search"
				}
			}
		}
	}
}

In this City Index I have documents like that:

St. Wolfgang or Sankt Wolfgang and so on. For me St. and Sankt are synonyms. So if I search for Sankt both of the documents should appear.

I created a new Filter:

"my_synonym_filter": {
   "type": "synonym",
    "ignore_case": "true",
    "synonyms": [
    	"sankt, st."
    ]
} 

So good for now. But the issues I faced are following:

Its clear that the dot after st is not analyzed and not searchable at the moment. But For the synonym the dot is important.

The second issue is if I search for sankt the synonym is also st which gives me also all documents which starts with st like Stuttgart. So this happens also because the dot is not used.

Do you have any idea how I can achieve the stuff? If you need any more information, please let me know.

Thanks in advance

You could modify your tokenizer to not remove punctuation by adding "punctuation" to the "token_chars" in your autocomplete tokenizer. However, maybe there is some punctuation that you do want to remove?

Alternatively you could normalize "St." to "Sankt" in the character filter phase, which takes place before tokenization. You could for example add the following character filter definition to the analysis settings:

      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "St. => sankt",
            "st. => sankt"
          ]
        }
      }

and update your analyzers to use this char filter:

        "autocomplete": {
          "char_filter": [
            "my_char_filter"
          ],
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase",
            "german_normalization",
            "my_ascii_folding"
          ]
        },
        "autocomplete_search": {
          "char_filter": [
            "my_char_filter"
          ],
          "tokenizer": "lowercase",
          "filter": [
            "lowercase",
            "german_normalization",
            "my_ascii_folding"
          ]
        }

Thanks abdon for your solution. I will check that.

I also got a solution which worked for me. I had to change my analyzer. This is the final setting I use for now:

{
	"settings": {
		"analysis": {
			"analyzer": {
				"autocomplete": {
					"tokenizer": "whitespace",
					"filter": [
						"lowercase",
						"my_synonym_filter",
						"german_normalization",
						"edge_filter"
					]
				},
				"autocomplete_search": {
					"tokenizer": "whitespace",
					"filter": [
						"lowercase",
						"my_synonym_filter",
						"german_normalization"
					]
				}
			},
			"filter": {
				"edge_filter": {
					"type": "edgeNGram",
					"min_gram": 1,
					"max_gram": 15
				},
				"my_synonym_filter": {
					"type": "synonym",
					"ignore_case": "true",
					"synonyms": [
						"sankt, st., santa, saint",
						"sveti, sv."
					]
				}
			}
		}
	},
	"mappings": {
		"city": {
			"properties": {
				"name": {
					"type": "text",
					"analyzer": "autocomplete",
					"search_analyzer": "autocomplete_search"
				}
			}
		}
	}
}

what do you think?

Looks good to me. You switched to a tokenizer that doesn't remove punctuation (because of the whitespace tokenizer). If that works for you: great! :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.