Hi Jilles,
I feel your pain here: false geo_shape positives are a known bug (Issue 2361https://github.com/elasticsearch/elasticsearch/issues/2361)
and I even have a pull request that can fix ithttps://github.com/elasticsearch/elasticsearch/pull/2460once and for all, though there are known ways to make this implementation
more efficient. At the time I posted it, Elasticsearch did not support
Lucene's binary stored fields, which is really desirable for the approach
my patch takes (but not strictly necessary; the patch fixes the problem
as-is).
My application does not allow false positives at all, so we end up having
to post-filter Elasticsearch's results based on their GeoJSON, which is
extremely expensive. Any tree index-based approach for geo querying will
have false positives, no matter how finely tuned. Post-processing is be
necessary to eliminate these, but ideally this could be done efficiently
inside Elasticsearch using binary representation of the shape, rather than
using GeoJSON outside Elasticsearch. (see the above issue for metrics
about this efficiency)
In the meantime, I agree with you that the quadtree is superior to geohash
for tradeoff between accuracy and index size. However, I recommend you
tune the tree_levels for your the particular shapes you are indexing and
for the spatial extents of your queries.
In particular, if tree_levels is too small, you will get tons of false
positives; if tree_levels is too large you will experience very high
latency at index time and/or query time. The latter is because the ES
mapping must generate all the tiles over your shape at the resolution
dictated by tree_levels and distance_error_pct, which ends up being an
algorithm exponential in the number of levels. For instance, if I am
indexing shapes on the scale of countries or timezones (say, Germany), and
I set quadtree tree_levels greater than about 12, each index operation
generates tens of thousands of cells for the Lucene index, driving up
latency. Here are some rough numbers on my laptop for indexing the outline
of Germany.
quadtree - Indexing Shape of Germany -
tree_levels cells indexed index time
4 4 21ms
12 1,224 71ms
14 4,917 527ms
20 320,973 10,618ms
Exactly the same consideration is relevant at query time, but here it's the
spatial size of your query that can increase latency if tree_levels is too
high. Implicitly, this means that you need to consider both query shape
extents and document shape extents when you choose an ideal tree_levels.
If those are on drastically different scales (e.g. 3 or more orders of
magnitude) from each other -- or if multiple shapes in the same mapping
have drastically different scales, there is no way to get around the pain
with geo_shape.
In short:
- I think the default of 12 tree_levels for quadtree is not a bad one;
there are negative consequences with setting it much higher. If any
default were to change, it should be to make quadtree the default instead
of geohash. However this upgrade would break existing indexes that merely
rely on the default tree type.
- To effectively use the geo_shape mapping really requires an
understanding of its implementation (which is not the best, it's rather
naive) and trial and error around an ideal tree_levels setting. The ES
documentation falls short for guidance here. I think the
RecursivePrefixTree Simon posted is less naive than geohash/quadtree, and
might address the latency problem better. (?)
- No matter how fine the index (including the RecursivePrefixTree),
there will always be false positives unless a post-filter step such as is
suggested in Issue 2361https://github.com/elasticsearch/elasticsearch/issues/2361is implemented.
Jeff
On Friday, March 8, 2013 7:35:11 AM UTC-8, Jilles van Gurp wrote:
I'm currently having major issues with geo_shape.
I have about 120000 geojson objects across the Berlin/Brandenburg area. A
mix of points, polygons, and linestrings.
I've created a polygon query
{"geo_shape"=>{"geometry"=>{"shape"=>{"type"=>"Polygon",
"coordinates"=>[[[52.52977589117429, 13.402019455925734],
[52.52991195268828, 13.401960657252268], [52.53003030129237,
13.401835622652401], [52.53011935220061, 13.401656591383937],
[52.53017038848964, 13.401441088274817], [52.5301784143719,
13.401210208270859], [52.53014264421813, 13.400986551515487],
[52.53006657946019, 13.400792011090223], [52.52995766584657,
13.400645629967338], [52.52982656460061, 13.400561736951035],
[52.52968610882571, 13.400548544074267], [52.529550047311716,
13.400607342747733], [52.529431698707626, 13.4007323773476],
[52.52934264779939, 13.400911408616064], [52.52929161151036,
13.401126911725184], [52.52928358562809, 13.401357791729142],
[52.529319355781865, 13.401581448484514], [52.52939542053981,
13.401775988909778], [52.52950433415343, 13.401922370032663],
[52.529635435399385, 13.402006263048966], [52.52977589117429,
13.402019455925734]]]}, "relation"=>"intersects"}}}}}}
The polygon is a roughly 50m radius circular shape around a point on
Rosenthaler platz (I've verified this on a map using the google maps api).
Using the quad tree implementation, I currently get about 1800 results for
this query, many of them miles away from where I searched. At the same
time, if I would broaden the radius of my circle to actually include those
results, I would expect tens of thousands of results not 1800. With the
radius this small, I would expect something in the range of a few dozen at
most.
With the geohash implementation, I get a bit better results, only 263
results. However, most of them are well outside the circle polygon (up to
several hundred meters away).
So both the quadtree and geohash implementation return massive amounts of
false positives. The quadtree implementation is much worse than the geohash
implementation.
The difference in index size is also massive: 868Mb for the geohash
implementation and 27Mb for the quadtree implementation. Most of the
geohash data seems to end up in the term dictionary. So from that point of
view, I'd very much prefer to use the quad tree implementation. However,
with the current inaccuracy that is not acceptable. I've tried setting
levels to 50 (the maximum according to the source code) but that doesn't
seem to have much effect.
BTW. I updated my snapshot build this afternoon to fix some much worse
issues where I couldn't get any results at all until I increased the radius
of the circle to 500km, at which point it would match everything at once.
So it seems things were improved somewhat over the past few days. The
previous snapshot build I used was only a few days old.
Jilles
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.