Searching for abbreviated German street names

I'm looking for an advice which query and/or analyzer settings to use for German street names in the form of "Johann-Sebastian-Bach-Straße"

I'm currently using the standard analyzer and a "match_phrase_prefix" query. This way the query term "Johann-Sebastian-Bach-Str." as well as the short "Bach Str." are matching.

But if someone searches for "J.-S.-Bach-Str.", "Johann-S.-Bach-Str.", "Joh.-Seb.-Bach-Str." there is no match. How could I construct a query that would accept such an abbrevieted variant as query term matching "Johann-Sebastian-Bach-Straße"? It must be something like a phrase query where every term can be prefixed... Any suggestions?

1 Like

It depends on how much precision you need, and how much context you have.

You could use the following reduction method to build an acronym. It's like stemming but with more brute force. Take the capital letters (or first letters) of the partial words, join them, put them to lowercase and index them:

"Johann-Sebastian-Bach-Straße" -> "jsbs"
"Johann-Sebastian-Bach-Str." -> "jsbs"
"J.-S.-Bach-Str." -> "jsbs"
"Johann-S.-Bach-Str." -> "jsbs"
"Joh.-Seb.-Bach-Str." -> "jsbs"

Note, because the code is very short, other street names will match as well, like this artificial street name

"Josef-Stefan-Baum-Str." -> "jsbs"

so it really depends how much precision you want. The more context of the street name you have, the better.

For example, for known streets, you can index the acronym code along with a city name, for example

"Joh.-Seb.-Bach-Str." -> "jsbs münchen"

Usually, you have a city name, or postal code, or other contextual information.

If the data must be verified, you can try to validate street names for existence, by a reference lookup of e.g.

With a city name, you could disambiguate acronyms

"jsbs münchen" - > "Johann-Sebastian-Bach-Straße, München"
"jsbs giessen" - > "Johann-Sebastian-Bach-Straße, Gießen"
"jsbs köln" - > does not exist

Thank you very much for your suggestion.

My use case is the following: I want to verify that a street name entered in a form field exists in our town, and I have a database that contains all official street names. My ES index is built from this database and contains the canonical street names.

If someone enters "Joh.-Seb.-Bach-Str." I consider this to be a valid, unambiguous street name. So I want it to match the official spelling "Johann-Sebastian-Bach-Straße".

One way I have been thinking of would be to add the alternative and abbreviated notations to the index. But this means that I have to guess all possible spellings and abbreviations for a street name at index time. I don't think this is possible.

Now I think of your suggestion to add an acronym of the street name to the index. Is this possible to generate with an Elasticsearch analyzer? But I suspect it would not be accurate enough for my use case...

If you must deal with a form field, and you have a list of all valid names, why not use autocompletion?

In fact I do use autocompletion in the form (using completion suggester) but I don't want to rely on it only. Firstly, if a user quickly enters something like "J.S." it would not suggest "Johann ..." if the user has already typed more than one character. The same is true if the user pastes the street name. Secondly I want to be able to validate existing address data from other sources.

The policy of accepting street names should be that they are considered to be valid as long as they are unambiguous. So "J.-S.-Bach-Straße" should be accepted as long as there is no other street name matching the initials. In fact, even on the official city map street names are written this way (ok, "J.-S.-Bach-Str." is a fictional example, but real existing examples from our official city map are "F.-Chr.-Baur-Str.", "Geschw.-Scholl-Platz" and "Pl. d. unbek. Deserteurs"). As a (German) human being, I understand that the label "Pl. d. unbek. Deserteurs" from the map matches "Platz des unbekannten Deserteurs" from the index of streets. So how can I tell that Elasticsearch?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.