Spatial search in elastic search (discussion)

I would like to get some feedback on different approaches for doing geo
spatial search in elastic search. I realize this is going to be a bit of an
open ended discussion and it will probably touch on many different things.

My company (localstre.am) is about building a location based publishing
platform. We want to location tag content blobs and index them using
elastic search and then find them back using a combination of range
searches and polygon searches. Important for us is going to be influencing
sorting and ranking so that we can come up with the most relevant content
given a context of a place or neighborhood, and other factors (categories,
ownership, timestamps, etc.). A typical query would be give me everything
in this neighborhood near this poi with category foo sorted by creation
date.

We're looking at roughly 60M places (i.e. points) world wide and a few 100K
to few M polygons for things like cities, neighborhoods, important areas,
etc.

Right now there are two out of the box options for geo spatial querying and
indexing in elastic search.

  1. use pin locations and distance search. This is probably too limited for
    our needs since we really want some kind of polygonal search.
  2. use geo_shape and polygon search. I'm using this currently but haven't
    done any benchmarking yet. Functionally it is sort of alright but I worry
    about scaling this.

So, both have their issues and both seem to be somewhat experimental.

I actually have a third option, which is based on what I think geo_shape is
doing behind the scenes but doing it manually (thus giving me more control).

I'm guessing most of you have no clue about geohashes; so here's a brief
explanation. A geohash translates a geo coordinate into a string. Geohashes
have the useful property that nearby coordinates have the same geohash
string prefix. Basically geohashes are the string encoded path to the
coordinate in a quad tree. Simply put, geohashes are squares on a map.

The maximum length of a geohash is 12 characters and this is an area
smaller than a square meter. A 7 character geohash is roughly a city block
and a one character geohash covers a large part of a continent/country.

These properties mean geohashes are great for indexing purposes. So in a
lucene context, you can take a polygon, calculate which geohashes cover it
and then index the polygon by simply associating the list of geohashes with
a field. For the inside of the polygon you use large geohashes and for the
borders you use smaller ones. You can go nuts here with having very fine
grained coverage but that results in many thousands of geohashes. Typically
you cap this at some level to avoid this.

I actually have a supporting library for working with geohashes and
polygons here: https://github.com/jillesvangurp/geotools. Relative to
spatial4j (used for the geo_shape implementation) this library mainly
distinguishes itself by not relying on model objects (shapes, points, etc)
and instead it uses simple double arrays to represent points and 2d double
arrays to represent polygons. Another useful feature of my library is that
I can specify the maximum prefix length for the geohashes when generating
the list of hashes for a polygon. So I have fine grained control over how
many geohashes I end up with. Generally, I'm OK with about city block level
accuracy (length of between 6 and 8). So typically a few hundred geohashes
for things like cities and neighborhoods. I also have a contains algorithm
so I can actually use coarse grained coverage and then simply filter out
anything that doesn't pass the contains algorithm.

At this point, I have working code for the geo_shape approach as well as my
manual approach.

What I'm looking for at this point is mainly some feedback on what to do
with my project. I'm worried about geo_shape on several fronts:

  1. reliance on spatial4j and its underlying use of model objects in my view
    results in a lot of memory overhead at query time and at indexing time.
  2. lack of control over granularity with respect to coverage and the
    resulting explosion of geohashes, query terms and the associated index
    bloat.
  3. lack of information on how this scales
  4. a general impression on my side that this code is relatively
    young/immature and a related worry about the lack of activity since it
    landed.
  5. some concerns about how to do some more advanced things like ranking and
    creating more complex queries

So, I'm leaning towards doing things manually as outlined above. I'd love
to hear what others think about that approach, any design alternatives
people can think of, any comments regarding current or upcoming geo spatial
features in elastic search that I missed.

Additionally, I'm open for any collaborations to integrate this into
elastic search as well. Given my lack of familiarity with the es code base
that would be somewhat of a steep learning curve for me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jilles,
We have been doing spatial search with elasticsearch but running into some
performance issues. We have about 50 million records. I was wondering
how you guys faired with this and if you could recommend an
indexing strategy. Basically we have a point (lat, lon) and want to know
the closest "N" records. Where n normally is between 1 and 50

I'm currently doing a geo_distance query and sorting by _geo_distance and
takes about 30 seconds per query.
(similar to
this...Loading...)

If you have any thoughts on this I would certainly appreciate it!

Dave

On Thursday, January 31, 2013 7:23:28 AM UTC-5, Jilles van Gurp wrote:

I would like to get some feedback on different approaches for doing geo
spatial search in Elasticsearch. I realize this is going to be a bit of an
open ended discussion and it will probably touch on many different things.

My company (localstre.am) is about building a location based publishing
platform. We want to location tag content blobs and index them using
Elasticsearch and then find them back using a combination of range
searches and polygon searches. Important for us is going to be influencing
sorting and ranking so that we can come up with the most relevant content
given a context of a place or neighborhood, and other factors (categories,
ownership, timestamps, etc.). A typical query would be give me everything
in this neighborhood near this poi with category foo sorted by creation
date.

We're looking at roughly 60M places (i.e. points) world wide and a few
100K to few M polygons for things like cities, neighborhoods, important
areas, etc.

Right now there are two out of the box options for geo spatial querying
and indexing in Elasticsearch.

  1. use pin locations and distance search. This is probably too limited for
    our needs since we really want some kind of polygonal search.
  2. use geo_shape and polygon search. I'm using this currently but haven't
    done any benchmarking yet. Functionally it is sort of alright but I worry
    about scaling this.

So, both have their issues and both seem to be somewhat experimental.

I actually have a third option, which is based on what I think geo_shape
is doing behind the scenes but doing it manually (thus giving me more
control).

I'm guessing most of you have no clue about geohashes; so here's a brief
explanation. A geohash translates a geo coordinate into a string. Geohashes
have the useful property that nearby coordinates have the same geohash
string prefix. Basically geohashes are the string encoded path to the
coordinate in a quad tree. Simply put, geohashes are squares on a map.

The maximum length of a geohash is 12 characters and this is an area
smaller than a square meter. A 7 character geohash is roughly a city block
and a one character geohash covers a large part of a continent/country.

These properties mean geohashes are great for indexing purposes. So in a
lucene context, you can take a polygon, calculate which geohashes cover it
and then index the polygon by simply associating the list of geohashes with
a field. For the inside of the polygon you use large geohashes and for the
borders you use smaller ones. You can go nuts here with having very fine
grained coverage but that results in many thousands of geohashes. Typically
you cap this at some level to avoid this.

I actually have a supporting library for working with geohashes and
polygons here: GitHub - jillesvangurp/geotools. Relative to
spatial4j (used for the geo_shape implementation) this library mainly
distinguishes itself by not relying on model objects (shapes, points, etc)
and instead it uses simple double arrays to represent points and 2d double
arrays to represent polygons. Another useful feature of my library is that
I can specify the maximum prefix length for the geohashes when generating
the list of hashes for a polygon. So I have fine grained control over how
many geohashes I end up with. Generally, I'm OK with about city block level
accuracy (length of between 6 and 8). So typically a few hundred geohashes
for things like cities and neighborhoods. I also have a contains algorithm
so I can actually use coarse grained coverage and then simply filter out
anything that doesn't pass the contains algorithm.

At this point, I have working code for the geo_shape approach as well as
my manual approach.

What I'm looking for at this point is mainly some feedback on what to do
with my project. I'm worried about geo_shape on several fronts:

  1. reliance on spatial4j and its underlying use of model objects in my
    view results in a lot of memory overhead at query time and at indexing
    time.
  2. lack of control over granularity with respect to coverage and the
    resulting explosion of geohashes, query terms and the associated index
    bloat.
  3. lack of information on how this scales
  4. a general impression on my side that this code is relatively
    young/immature and a related worry about the lack of activity since it
    landed.
  5. some concerns about how to do some more advanced things like ranking
    and creating more complex queries

So, I'm leaning towards doing things manually as outlined above. I'd love
to hear what others think about that approach, any design alternatives
people can think of, any comments regarding current or upcoming geo spatial
features in Elasticsearch that I missed.

Additionally, I'm open for any collaborations to integrate this into
Elasticsearch as well. Given my lack of familiarity with the es code base
that would be somewhat of a steep learning curve for me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/afd295b5-5858-482f-b97d-bdbba336f981%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Dave,

Basically with that many records, you'll want to do two things:

  1. avoid doing distance calculations until you narrow down the results
    to something manageable (i.e. a few hundred results)
  2. utilize more nodes to ensure all indexes you use fit into the
    collective memory.

For the first, you may be able to utilize e.g. geohashes to pre-filter
your queries. Thirty seconds certainly sound like you are spending a
lot of time calculating stuff.

Jilles

On Mon, Dec 2, 2013 at 10:50 PM, Dave O mdoakes42@gmail.com wrote:

Hi Jilles,
We have been doing spatial search with elasticsearch but running into some
performance issues. We have about 50 million records. I was wondering how
you guys faired with this and if you could recommend an indexing strategy.
Basically we have a point (lat, lon) and want to know the closest "N"
records. Where n normally is between 1 and 50

I'm currently doing a geo_distance query and sorting by _geo_distance and
takes about 30 seconds per query.
(similar to
this...http://www.elasticsearchtutorial.com/spatial-search-tutorial.html)

If you have any thoughts on this I would certainly appreciate it!

Dave

On Thursday, January 31, 2013 7:23:28 AM UTC-5, Jilles van Gurp wrote:

I would like to get some feedback on different approaches for doing geo
spatial search in Elasticsearch. I realize this is going to be a bit of an
open ended discussion and it will probably touch on many different things.

My company (localstre.am) is about building a location based publishing
platform. We want to location tag content blobs and index them using elastic
search and then find them back using a combination of range searches and
polygon searches. Important for us is going to be influencing sorting and
ranking so that we can come up with the most relevant content given a
context of a place or neighborhood, and other factors (categories,
ownership, timestamps, etc.). A typical query would be give me everything in
this neighborhood near this poi with category foo sorted by creation date.

We're looking at roughly 60M places (i.e. points) world wide and a few
100K to few M polygons for things like cities, neighborhoods, important
areas, etc.

Right now there are two out of the box options for geo spatial querying
and indexing in Elasticsearch.

  1. use pin locations and distance search. This is probably too limited for
    our needs since we really want some kind of polygonal search.
  2. use geo_shape and polygon search. I'm using this currently but haven't
    done any benchmarking yet. Functionally it is sort of alright but I worry
    about scaling this.

So, both have their issues and both seem to be somewhat experimental.

I actually have a third option, which is based on what I think geo_shape
is doing behind the scenes but doing it manually (thus giving me more
control).

I'm guessing most of you have no clue about geohashes; so here's a brief
explanation. A geohash translates a geo coordinate into a string. Geohashes
have the useful property that nearby coordinates have the same geohash
string prefix. Basically geohashes are the string encoded path to the
coordinate in a quad tree. Simply put, geohashes are squares on a map.

The maximum length of a geohash is 12 characters and this is an area
smaller than a square meter. A 7 character geohash is roughly a city block
and a one character geohash covers a large part of a continent/country.

These properties mean geohashes are great for indexing purposes. So in a
lucene context, you can take a polygon, calculate which geohashes cover it
and then index the polygon by simply associating the list of geohashes with
a field. For the inside of the polygon you use large geohashes and for the
borders you use smaller ones. You can go nuts here with having very fine
grained coverage but that results in many thousands of geohashes. Typically
you cap this at some level to avoid this.

I actually have a supporting library for working with geohashes and
polygons here: GitHub - jillesvangurp/geotools. Relative to
spatial4j (used for the geo_shape implementation) this library mainly
distinguishes itself by not relying on model objects (shapes, points, etc)
and instead it uses simple double arrays to represent points and 2d double
arrays to represent polygons. Another useful feature of my library is that I
can specify the maximum prefix length for the geohashes when generating the
list of hashes for a polygon. So I have fine grained control over how many
geohashes I end up with. Generally, I'm OK with about city block level
accuracy (length of between 6 and 8). So typically a few hundred geohashes
for things like cities and neighborhoods. I also have a contains algorithm
so I can actually use coarse grained coverage and then simply filter out
anything that doesn't pass the contains algorithm.

At this point, I have working code for the geo_shape approach as well as
my manual approach.

What I'm looking for at this point is mainly some feedback on what to do
with my project. I'm worried about geo_shape on several fronts:

  1. reliance on spatial4j and its underlying use of model objects in my
    view results in a lot of memory overhead at query time and at indexing time.
  2. lack of control over granularity with respect to coverage and the
    resulting explosion of geohashes, query terms and the associated index
    bloat.
  3. lack of information on how this scales
  4. a general impression on my side that this code is relatively
    young/immature and a related worry about the lack of activity since it
    landed.
  5. some concerns about how to do some more advanced things like ranking
    and creating more complex queries

So, I'm leaning towards doing things manually as outlined above. I'd love
to hear what others think about that approach, any design alternatives
people can think of, any comments regarding current or upcoming geo spatial
features in Elasticsearch that I missed.

Additionally, I'm open for any collaborations to integrate this into
Elasticsearch as well. Given my lack of familiarity with the es code base
that would be somewhat of a steep learning curve for me.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/XxTluHXxduw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/afd295b5-5858-482f-b97d-bdbba336f981%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAAhxf8HUp2SN2iJkaxg_1LEZVQAT81B8UHKuL_C0MZafkF-utg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.