GeoShape - rectangle intersection performance


(Roman Margolis) #1

Suppose I have documents with a Geo bounding box information on them (for example, [bottom, left], [top, right]).
Given another bounding box as input, I'd like to find all documents who's bounding box intersects with the given input.

Based on what I know about Geo Location in elastic, I conclude that bounding box filters will not suffice for my needs, because they operate on Geo points, so it is possible that the two bounding boxes will intersect each other, but none of the boxes Geo points will be bound by any of the boxes. Geohash_cell filter will not be enough for the same reason as well. Is this correct?

If it is, I'm left with two possible choices:

  1. GeoShape indexing and filtering
  2. Script filter that will use simple rectangle intersection logic of the form:
    rect1.left <= rect2.right && rect2.left <= rect1.right && rect1.bottom <= rect2.top && rect2.bottom <= rect1.top

My question is which of these do you think will be preferable in terms of query latency and/or index storage?

Thanks


(Roman Margolis) #2

I performed a simple benchmark to test these options, with tens of millions of simulated documents with geolocations on a single shard.

In terms of required storage, envelope geoshape with precision of ~1km, or tree levels 6 (with geohash encoding) surprisingly, did not require more storage than two geopoints (upper left, bottom right) with doc values. That could be explained perhaps by the relatively small bounding boxes that my simulated documents contained, because the smaller the bounding boxes are, the less geohash terms they generate. However, using tree levels 7, incurs a significant ~600% increase in required storage.

In terms of query latency, no surprise there. The Geoshape query performed always in sub-second intervals, usually no more than 100 ms, even with cold page cache. In contrast, the scripted filter required a couple of seconds to retrieve the results.

I also noticed that the geoshape envelope indexing was slightly slower than its two geopoints doc valued counterpart when using tree levels 6, and much slower when using tree levels 7. Again, this is understandably so, due to the many more geohash terms being used.

In conclusion, it would seem Geoshape indexing and querying is to be preferred in most cases, depending on your accuracy demands. The sole benefit of the two doc valued geopoints variant is that it's 100% accurate (in terms of rectangle intersection).


(system) #3