I would like to get some feedback on different approaches for doing geo
spatial search in elastic search. I realize this is going to be a bit of an
open ended discussion and it will probably touch on many different things.
My company (localstre.am) is about building a location based publishing
platform. We want to location tag content blobs and index them using
elastic search and then find them back using a combination of range
searches and polygon searches. Important for us is going to be influencing
sorting and ranking so that we can come up with the most relevant content
given a context of a place or neighborhood, and other factors (categories,
ownership, timestamps, etc.). A typical query would be give me everything
in this neighborhood near this poi with category foo sorted by creation
date.
We're looking at roughly 60M places (i.e. points) world wide and a few 100K
to few M polygons for things like cities, neighborhoods, important areas,
etc.
Right now there are two out of the box options for geo spatial querying and
indexing in elastic search.
- use pin locations and distance search. This is probably too limited for
our needs since we really want some kind of polygonal search. - use geo_shape and polygon search. I'm using this currently but haven't
done any benchmarking yet. Functionally it is sort of alright but I worry
about scaling this.
So, both have their issues and both seem to be somewhat experimental.
I actually have a third option, which is based on what I think geo_shape is
doing behind the scenes but doing it manually (thus giving me more control).
I'm guessing most of you have no clue about geohashes; so here's a brief
explanation. A geohash translates a geo coordinate into a string. Geohashes
have the useful property that nearby coordinates have the same geohash
string prefix. Basically geohashes are the string encoded path to the
coordinate in a quad tree. Simply put, geohashes are squares on a map.
The maximum length of a geohash is 12 characters and this is an area
smaller than a square meter. A 7 character geohash is roughly a city block
and a one character geohash covers a large part of a continent/country.
These properties mean geohashes are great for indexing purposes. So in a
lucene context, you can take a polygon, calculate which geohashes cover it
and then index the polygon by simply associating the list of geohashes with
a field. For the inside of the polygon you use large geohashes and for the
borders you use smaller ones. You can go nuts here with having very fine
grained coverage but that results in many thousands of geohashes. Typically
you cap this at some level to avoid this.
I actually have a supporting library for working with geohashes and
polygons here: https://github.com/jillesvangurp/geotools. Relative to
spatial4j (used for the geo_shape implementation) this library mainly
distinguishes itself by not relying on model objects (shapes, points, etc)
and instead it uses simple double arrays to represent points and 2d double
arrays to represent polygons. Another useful feature of my library is that
I can specify the maximum prefix length for the geohashes when generating
the list of hashes for a polygon. So I have fine grained control over how
many geohashes I end up with. Generally, I'm OK with about city block level
accuracy (length of between 6 and 8). So typically a few hundred geohashes
for things like cities and neighborhoods. I also have a contains algorithm
so I can actually use coarse grained coverage and then simply filter out
anything that doesn't pass the contains algorithm.
At this point, I have working code for the geo_shape approach as well as my
manual approach.
What I'm looking for at this point is mainly some feedback on what to do
with my project. I'm worried about geo_shape on several fronts:
- reliance on spatial4j and its underlying use of model objects in my view
results in a lot of memory overhead at query time and at indexing time. - lack of control over granularity with respect to coverage and the
resulting explosion of geohashes, query terms and the associated index
bloat. - lack of information on how this scales
- a general impression on my side that this code is relatively
young/immature and a related worry about the lack of activity since it
landed. - some concerns about how to do some more advanced things like ranking and
creating more complex queries
So, I'm leaning towards doing things manually as outlined above. I'd love
to hear what others think about that approach, any design alternatives
people can think of, any comments regarding current or upcoming geo spatial
features in elastic search that I missed.
Additionally, I'm open for any collaborations to integrate this into
elastic search as well. Given my lack of familiarity with the es code base
that would be somewhat of a steep learning curve for me.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.