Geo_shape problems

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area. A
mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results for
this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts of
false positives. The quadtree implementation is much worse than the geohash
implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

My setting for levels was wrong, this should be tree_levels. With quad tree
and tree_levels set to 50, it gets somewhat better. The index size grows to
707MB and I get 255 results within a few hundred meters.

I think that is still way too many false positives.

Jilles

On Friday, March 8, 2013 4:35:11 PM UTC+1, Jilles van Gurp wrote:

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area. A
mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results for
this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts of
false positives. The quadtree implementation is much worse than the geohash
implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

hey jilles,

I think we need to work on some better defaults here, can you open and
issue to adjust defaults?
There are also upstream improvements coming in from Lucene 4.2 which should
be close as in the next couple of days. I hope this improves things as well

simon

On Friday, March 8, 2013 4:35:11 PM UTC+1, Jilles van Gurp wrote:

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area. A
mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results for
this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts of
false positives. The quadtree implementation is much worse than the geohash
implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Done, https://github.com/elasticsearch/elasticsearch/issues/2756

On Saturday, March 9, 2013 10:33:27 PM UTC+1, simonw wrote:

hey jilles,

I think we need to work on some better defaults here, can you open and
issue to adjust defaults?
There are also upstream improvements coming in from Lucene 4.2 which
should be close as in the next couple of days. I hope this improves things
as well

simon

On Friday, March 8, 2013 4:35:11 PM UTC+1, Jilles van Gurp wrote:

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area. A
mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results
for this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts of
false positives. The quadtree implementation is much worse than the geohash
implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hey Jilles,

have you seen

you can change the strategy with that patch already.

--Alex

On Sun, Mar 10, 2013 at 11:29 AM, Jilles van Gurp
jillesvangurp@gmail.comwrote:

Done, https://github.com/elasticsearch/elasticsearch/issues/2756

On Saturday, March 9, 2013 10:33:27 PM UTC+1, simonw wrote:

hey jilles,

I think we need to work on some better defaults here, can you open and
issue to adjust defaults?
There are also upstream improvements coming in from Lucene 4.2 which
should be close as in the next couple of days. I hope this improves things
as well

simon

On Friday, March 8, 2013 4:35:11 PM UTC+1, Jilles van Gurp wrote:

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area.
A mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"**shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.**52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results
for this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts
of false positives. The quadtree implementation is much worse than the
geohash implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Yes, I've seen it and used snapshots with that patch. There are still
severe accuracy problems that are addressed by the lucene 4.2 branch that
Simon has been working on (which also includes the patch). So, that is good
news. The current pre 4.2 snapshots are not very usable for geo_shape.

The ticket above is about changing the defaults because the current ones
are problematic.

Jilles

On Sunday, March 10, 2013 11:54:12 AM UTC+1, Alexander Reelsen wrote:

Hey Jilles,

have you seen
https://github.com/elasticsearch/elasticsearch/commit/881cb7900c1376b62516f541b12c3bb02c1cdfba

you can change the strategy with that patch already.

--Alex

On Sun, Mar 10, 2013 at 11:29 AM, Jilles van Gurp <jilles...@gmail.com<javascript:>

wrote:

Done, https://github.com/elasticsearch/elasticsearch/issues/2756

On Saturday, March 9, 2013 10:33:27 PM UTC+1, simonw wrote:

hey jilles,

I think we need to work on some better defaults here, can you open and
issue to adjust defaults?
There are also upstream improvements coming in from Lucene 4.2 which
should be close as in the next couple of days. I hope this improves things
as well

simon

On Friday, March 8, 2013 4:35:11 PM UTC+1, Jilles van Gurp wrote:

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area.
A mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"**shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.**52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results
for this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts
of false positives. The quadtree implementation is much worse than the
geohash implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jilles,
I feel your pain here: false geo_shape positives are a known bug (Issue 2361https://github.com/elasticsearch/elasticsearch/issues/2361)
and I even have a pull request that can fix ithttps://github.com/elasticsearch/elasticsearch/pull/2460once and for all, though there are known ways to make this implementation
more efficient. At the time I posted it, Elasticsearch did not support
Lucene's binary stored fields, which is really desirable for the approach
my patch takes (but not strictly necessary; the patch fixes the problem
as-is).

My application does not allow false positives at all, so we end up having
to post-filter Elasticsearch's results based on their GeoJSON, which is
extremely expensive. Any tree index-based approach for geo querying will
have false positives, no matter how finely tuned. Post-processing is be
necessary to eliminate these, but ideally this could be done efficiently
inside Elasticsearch using binary representation of the shape, rather than
using GeoJSON outside Elasticsearch. (see the above issue for metrics
about this efficiency)

In the meantime, I agree with you that the quadtree is superior to geohash
for tradeoff between accuracy and index size. However, I recommend you
tune the tree_levels for your the particular shapes you are indexing and
for the spatial extents of your queries.

In particular, if tree_levels is too small, you will get tons of false
positives; if tree_levels is too large you will experience very high
latency at index time and/or query time. The latter is because the ES
mapping must generate all the tiles over your shape at the resolution
dictated by tree_levels and distance_error_pct, which ends up being an
algorithm exponential in the number of levels. For instance, if I am
indexing shapes on the scale of countries or timezones (say, Germany), and
I set quadtree tree_levels greater than about 12, each index operation
generates tens of thousands of cells for the Lucene index, driving up
latency. Here are some rough numbers on my laptop for indexing the outline
of Germany.

quadtree - Indexing Shape of Germany -
tree_levels cells indexed index time

  4                       4            21ms
 12                   1,224            71ms
 14                   4,917           527ms
 20                 320,973        10,618ms

Exactly the same consideration is relevant at query time, but here it's the
spatial size of your query that can increase latency if tree_levels is too
high. Implicitly, this means that you need to consider both query shape
extents and document shape extents when you choose an ideal tree_levels.
If those are on drastically different scales (e.g. 3 or more orders of
magnitude) from each other -- or if multiple shapes in the same mapping
have drastically different scales, there is no way to get around the pain
with geo_shape.

In short:

  • I think the default of 12 tree_levels for quadtree is not a bad one;
    there are negative consequences with setting it much higher. If any
    default were to change, it should be to make quadtree the default instead
    of geohash. However this upgrade would break existing indexes that merely
    rely on the default tree type.
  • To effectively use the geo_shape mapping really requires an
    understanding of its implementation (which is not the best, it's rather
    naive) and trial and error around an ideal tree_levels setting. The ES
    documentation falls short for guidance here. I think the
    RecursivePrefixTree Simon posted is less naive than geohash/quadtree, and
    might address the latency problem better. (?)
  • No matter how fine the index (including the RecursivePrefixTree),
    there will always be false positives unless a post-filter step such as is
    suggested in Issue 2361https://github.com/elasticsearch/elasticsearch/issues/2361is implemented.

Jeff

On Friday, March 8, 2013 7:35:11 AM UTC-8, Jilles van Gurp wrote:

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area. A
mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results for
this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts of
false positives. The quadtree implementation is much worse than the geohash
implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Jeff, you may want to participate
in https://github.com/elasticsearch/elasticsearch/issues/2756

simon

On Sunday, March 10, 2013 11:07:32 PM UTC+1, Jeffrey Gerard wrote:

Hi Jilles,
I feel your pain here: false geo_shape positives are a known bug (Issue
2361 https://github.com/elasticsearch/elasticsearch/issues/2361) and I
even have a pull request that can fix ithttps://github.com/elasticsearch/elasticsearch/pull/2460once and for all, though there are known ways to make this implementation
more efficient. At the time I posted it, Elasticsearch did not support
Lucene's binary stored fields, which is really desirable for the approach
my patch takes (but not strictly necessary; the patch fixes the problem
as-is).

My application does not allow false positives at all, so we end up having
to post-filter Elasticsearch's results based on their GeoJSON, which is
extremely expensive. Any tree index-based approach for geo querying will
have false positives, no matter how finely tuned. Post-processing is be
necessary to eliminate these, but ideally this could be done efficiently
inside Elasticsearch using binary representation of the shape, rather than
using GeoJSON outside Elasticsearch. (see the above issue for metrics
about this efficiency)

In the meantime, I agree with you that the quadtree is superior to geohash
for tradeoff between accuracy and index size. However, I recommend you
tune the tree_levels for your the particular shapes you are indexing and
for the spatial extents of your queries.

In particular, if tree_levels is too small, you will get tons of false
positives; if tree_levels is too large you will experience very high
latency at index time and/or query time. The latter is because the ES
mapping must generate all the tiles over your shape at the resolution
dictated by tree_levels and distance_error_pct, which ends up being an
algorithm exponential in the number of levels. For instance, if I am
indexing shapes on the scale of countries or timezones (say, Germany), and
I set quadtree tree_levels greater than about 12, each index operation
generates tens of thousands of cells for the Lucene index, driving up
latency. Here are some rough numbers on my laptop for indexing the outline
of Germany.

quadtree - Indexing Shape of Germany -
tree_levels cells indexed index time

  4                       4            21ms
 12                   1,224            71ms
 14                   4,917           527ms
 20                 320,973        10,618ms

Exactly the same consideration is relevant at query time, but here it's
the spatial size of your query that can increase latency if tree_levels is
too high. Implicitly, this means that you need to consider both query
shape extents and document shape extents when you choose an ideal
tree_levels. If those are on drastically different scales (e.g. 3 or more
orders of magnitude) from each other -- or if multiple shapes in the same
mapping have drastically different scales, there is no way to get around
the pain with geo_shape.

In short:

  • I think the default of 12 tree_levels for quadtree is not a bad one;
    there are negative consequences with setting it much higher. If any
    default were to change, it should be to make quadtree the default instead
    of geohash. However this upgrade would break existing indexes that merely
    rely on the default tree type.
  • To effectively use the geo_shape mapping really requires an
    understanding of its implementation (which is not the best, it's rather
    naive) and trial and error around an ideal tree_levels setting. The ES
    documentation falls short for guidance here. I think the
    RecursivePrefixTree Simon posted is less naive than geohash/quadtree, and
    might address the latency problem better. (?)
  • No matter how fine the index (including the RecursivePrefixTree),
    there will always be false positives unless a post-filter step such as is
    suggested in Issue 2361https://github.com/elasticsearch/elasticsearch/issues/2361is implemented.

Jeff

On Friday, March 8, 2013 7:35:11 AM UTC-8, Jilles van Gurp wrote:

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area. A
mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results
for this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts of
false positives. The quadtree implementation is much worse than the geohash
implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jeff,

That is valuable information. I'd love to see your patch integrated.

Meanwhile, I think the quad tree default is too low unless you have a
margin of error measured in kilometers. I think for most geospatial search
solutions that would not be adequate. But I agree that the best
configuration is highly dependent on the usecase.

I think open street map offers sort of a bit worse case than the average
use case in terms of pretty dense and a mix of all sorts of shapes and
sizes. I've extracted administrative boundaries, pois, streets, building
polygons etc. For this usecase you need to constrain the margin of error
perhaps a bit more than for a very sparsely populated index. At the
defaults, you get thousands of results for any query.

I've yet to try the recursive prefix tree and am wondering how to configure
that.

Jilles

On Sunday, March 10, 2013 11:07:32 PM UTC+1, Jeffrey Gerard wrote:

Hi Jilles,
I feel your pain here: false geo_shape positives are a known bug (Issue
2361 https://github.com/elasticsearch/elasticsearch/issues/2361) and I
even have a pull request that can fix ithttps://github.com/elasticsearch/elasticsearch/pull/2460once and for all, though there are known ways to make this implementation
more efficient. At the time I posted it, Elasticsearch did not support
Lucene's binary stored fields, which is really desirable for the approach
my patch takes (but not strictly necessary; the patch fixes the problem
as-is).

My application does not allow false positives at all, so we end up having
to post-filter Elasticsearch's results based on their GeoJSON, which is
extremely expensive. Any tree index-based approach for geo querying will
have false positives, no matter how finely tuned. Post-processing is be
necessary to eliminate these, but ideally this could be done efficiently
inside Elasticsearch using binary representation of the shape, rather than
using GeoJSON outside Elasticsearch. (see the above issue for metrics
about this efficiency)

In the meantime, I agree with you that the quadtree is superior to geohash
for tradeoff between accuracy and index size. However, I recommend you
tune the tree_levels for your the particular shapes you are indexing and
for the spatial extents of your queries.

In particular, if tree_levels is too small, you will get tons of false
positives; if tree_levels is too large you will experience very high
latency at index time and/or query time. The latter is because the ES
mapping must generate all the tiles over your shape at the resolution
dictated by tree_levels and distance_error_pct, which ends up being an
algorithm exponential in the number of levels. For instance, if I am
indexing shapes on the scale of countries or timezones (say, Germany), and
I set quadtree tree_levels greater than about 12, each index operation
generates tens of thousands of cells for the Lucene index, driving up
latency. Here are some rough numbers on my laptop for indexing the outline
of Germany.

quadtree - Indexing Shape of Germany -
tree_levels cells indexed index time

  4                       4            21ms
 12                   1,224            71ms
 14                   4,917           527ms
 20                 320,973        10,618ms

Exactly the same consideration is relevant at query time, but here it's
the spatial size of your query that can increase latency if tree_levels is
too high. Implicitly, this means that you need to consider both query
shape extents and document shape extents when you choose an ideal
tree_levels. If those are on drastically different scales (e.g. 3 or more
orders of magnitude) from each other -- or if multiple shapes in the same
mapping have drastically different scales, there is no way to get around
the pain with geo_shape.

In short:

  • I think the default of 12 tree_levels for quadtree is not a bad one;
    there are negative consequences with setting it much higher. If any
    default were to change, it should be to make quadtree the default instead
    of geohash. However this upgrade would break existing indexes that merely
    rely on the default tree type.
  • To effectively use the geo_shape mapping really requires an
    understanding of its implementation (which is not the best, it's rather
    naive) and trial and error around an ideal tree_levels setting. The ES
    documentation falls short for guidance here. I think the
    RecursivePrefixTree Simon posted is less naive than geohash/quadtree, and
    might address the latency problem better. (?)
  • No matter how fine the index (including the RecursivePrefixTree),
    there will always be false positives unless a post-filter step such as is
    suggested in Issue 2361https://github.com/elasticsearch/elasticsearch/issues/2361is implemented.

Jeff

On Friday, March 8, 2013 7:35:11 AM UTC-8, Jilles van Gurp wrote:

I'm currently having major issues with geo_shape.

I have about 120000 geojson objects across the Berlin/Brandenburg area. A
mix of points, polygons, and linestrings.

I've created a polygon query

{"geo_shape"=>{"geometry"=>{"shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}

The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).

Using the quad tree implementation, I currently get about 1800 results
for this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.

With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).

So both the quadtree and geohash implementation return massive amounts of
false positives. The quadtree implementation is much worse than the geohash
implementation.

The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.

BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.

Jilles

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I've uploaded my data set to dropbox for people to play with. Simply
indexing it with different settings and querying it is quite revealing
w.r.t. the above issues.

https://dl.dropbox.com/u/18756426/osm-geojson.zip

This file contains geojson that I've extracted from the osm xml for the
Brandenburg area and includes about 120K features ranging from streets,
pois, building polygons and some neighborhood & city polygons.

I've also included a public domain file with country borders as well. My
project for converting the osm xml into geojson is very much a work in
progress but feel free to jump
in: https://github.com/jillesvangurp/osm2geojson/

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.