ElasticSearch as a (reverse) Geocoding tool

(Karussell) #1

Hi all,

komoot has build the address search called photon where Sarah from Nominatim, Svantulden from Route.nl, sdole, yohanboniface and I are involved a bit in development. Photon turns OpenStreetMap data into Nominatim (Based on postgres DB), cleans up the data there and feeds it into ElasticSearch.

I use photon for our GraphHopper Geocoding API - e.g. try it here via address autocompletion.

First of all I would like to invite anyone to contribute to this open source project. The setup is easy and now including weekly world wide data updates provided from us.

The problem why I'm also posting here is that we would like to improve quality in areas like Austria, where the current geocoding does not that well work for house number precise requests due to missing address data in OpenStreetMap there (lack of house-number precision). We all could better map houses in these areas :smile: but for the time being we could also use different data from e.g. openaddresses.io having also millions of addresses.

We would really like to have this integration which could work e.g. as follows:

  • make photon feeding the data into a different index of ElasticSearch
  • remove duplicates in the openaddresses index, which are already present in OpenStreetMap (MLT queries?)
  • merge the two indices

It is probably not such a simple task but we've even acquired a certain budget and we are seeking now both: 1. financial support and 2. possible contributors, companies or freelancers doing the necessary work.

Let me know what you think about photon in general. And also ping me (public or private) if you would have interests in supporting this work somehow.

(Jillesvangurp) #2

From having attempted geocoding using open streetmaps myself, you need a gazetteer to stand a chance of doing a good job of geocoding and reverse geocoding.

These guys are building a world wide gazzeteer of geo data on github. This should be a good source of data.Currently this seems to be the most actively maintained, most comprehensive Gazzeteer out there. If at all possible collaborate with these people instead of (partially) duplicating the work. For reverse geocoding, this might get you things like neighborhoods and city information that is way better than OSM. Most geocoders are really lousy figuring out the correct neighborhood for a coordinate.

I built a litle reverse geocoder a few years ago that used OSM also on top of Elasticsearch. I was using hybrid info from Yahoo geoplaces, Foursquare quatroshapes, OSM, several sources with POIs, etc.

IMHO, any good reverse geocoding solution ideally needs to work with many different and overlapping data sources, one of which is open streetmaps. Data is nasty and there are differences in quality between data sets that vary depending on location, the type of thing you are geocoding, etc. Instead of removing duplicates from the data, I would fix things on the querying side and de-duplicate based on quality indicators. For some coordinates OSM might provide the better match and for some things e.g. a wosonfirst venue might be better if it has a house number and a coordinate; it all depends. So, get a match for a coordinate from all data sources you have and then sort them by quality (completeness of information, interpolated or absolute house number, distanct to input coordinate, confidence or other scores, etc.). This has two advantages: you don't throw away data and you can fix any issues by tweaking your algorithm instead of reimporting all your data.

The way my reverse geocoder worked was that I did a geoshape query to figure out which indexded shapes overlapped with the coordinate. I had different indices for different types of shape data. One advantage of that is that it allows me to have different ways of indexing data but you can target all indices with one query still. E.g. a high resolution polygon for new zealand has very different indexing requirements from e.g. a simple house shape. From the query results I was able to get a short list of pois, street segments, neighborhoods, cities, and countries. I then postprocessed the results to calculate the nearest street segment (perpendicular distance to the coordinate) and from that figure out the neighborhood and city. Neighborhoods were important for my usecase and the main reason I developed my own reverse geocoder. I still think this is a good approach. My main issue was world wide coverage for neighborhoods. I'd love to have another go at this project but sadly it doesn't fit with my current project.

(Karussell) #3

Thanks @jillesvangurp - I'll have a look into the project and also think about your suggestion! I also need to highlight that it is not only about reverse geocoding but the current even more important part is normal geocoding. I just added the 'reverse' to the title to show that photon can do both.

(Jillesvangurp) #4

You can make the same argument for geocoding. There will be many things that name match your query from different datasets and you need to figure out the best matches based on a lot of different criteria. In general it's better to throw away data at query time than before index time. You should focus preprocessing on enriching data such that it helps you make it easier to filter it out at query time. One nice thing about whosonfirst is that they include a confidence score. So, they actually tell you how trustworthy the data is.

When I was still in Nokia we had loads of fun with different poi and landmark data sets. One of the fun issues we had was that a well known hotel poi data set had coordinates that were unreliable because hotel owners deliberately positioned their hotel icon near the beach instead of their real location for marketing purposes. This made them useless for navigation (our main use case). Other fun stuff you may run into is that some data sets provide super finegrained information. For example airports can have multiple terminals, mutliple entrances, many POIs etc. So, when geocoding, it actually depends on the usecase what is the best option. If you are navigating by car you want to go to the right terminal or parking. If you are delivering goods you might want to navigate to the freight terminal, if you are coming by public transport to a nearby station, if you are displaying something on a map, you want a centroid, etc. Geocoding is a hard problem.

Btw. I know some of the guys behind http://www.opencagedata.com/. They're good guys to talk to as well. I believe they also use OSM as part of their input. You might want to talk to them to see if they have something useful to share on house number data.

(Jillesvangurp) #5

If you are coming to the Berlin wherecamp in November, the opencage guys might be represented there. I believe Gary Gale is going to be there at least.

(system) #6