How are Geo Querys working in Elasticsearch?

Hello,

I’m currently writing my master thesis about the the geo functionalities of Elasticsearch. I’m testing the geo querys and compare them to PostGIS. For this purpose I’m trying to understand how this querys work. I’m trying to figure it out from the source code but it’s sometimes a bit difficult. So I wrote down how I understood the algorithms for the querys and some questions that I still have. It would really help me if you take the time to read and answer my post and I think it will be informative for others too.

Geo Bounding Box Query
Uses LatLonPoint.newBoxQuery() → newBoxInternal() → PointRangeQuery()

So at search time, the BKD-tree is recursed based on whether each of left or right child overlap with the query shape, and once a leaf block is reached, all documents in that leaf block are collected if the cell is fully enclosed by the query shape.

And if the cell just intersects the box, all 1024 points will be checked against the box one by one ?

Geo Distance Query
LatLonPoint.newDistanceQuery() → LatLonPointDistanceQuery()

  1. creates bounding box from distance and center point (Rectangle.fromPointDistance())

  2. going down the search tree, if a subtree is outside the bounding box discard it (Relation compare())

  3. if a subtree is inside the bounding box or intersects, check if it’s inside or outside the circle (GeoUtilsrelate())

  4. If it crosses the circle check points in leaf if inside or outside

Geo Polygon Query
LatLonPoint.newPolygonQuery() → LaLonPointInPolygonQuery()

So here at first you calculate a bounding box from min/max lat/lon of the polygon and discard subtrees outside of the bounding box.

The other Subtrees/ cells are tested against the Query Polygon.

Therefor you use Polygon2D to create a 2d kd-tree out of the polygon:

// each component/hole is a node in an augmented 2d kd-tree: we alternate splitting between latitude/longitude,

// and pull up max values for both dimensions to each parent node (regardless of split).

And as I understand you use this kd-Tree to test the cells of the bkd-tree that are inside the bounding box if they are inside the polygon, as seen here

This algorithm really is hard for me to understand in detail.

Geohash grid Aggregation
A Geohash grid is created and every document gets checked in which bucket (Geohash) it belongs.

Geo Distance Aggregation
Here the documents get sorted into buckets based on their distance to the origin point.
Is there some kind of preselection if there is no bucket without an upper distance limit or will the distance be calculated for every document ?

Geo Distance Sorting
If sorting descending, distances from origin to every point gets calculated and then sorted.

If sorting ascending, a bounding box from the min competitive distance is build to preselect points to sort.

This is already discussed here

Geo Centroid Aggregation
Here the moving average for latitude and longitude are calculated.

Geo Bounds Aggregation
Minimum and maximum for latitude and longitude is calculated.

So if I understand this correctly, the functionalities of the bkd-Tree are only used for Geo Bounding Box Query, Geo Distance Query, Geo Polygon Query and ascending Geo Distance Sorting ?

Sorry, this is only a quick reply, I might be able to reply more in detail in a few weeks:

All queries works the same when traversing the tree, the way you explained in the bounding box queries. The reason there are different implementations is to try to optimise for each case.

For polygons, have a look into this pull request, it is a refactoring of the edge tree because it is indeed a bit confusing. There are two interval trees, one for components (think about a geometry collections, aka geometry == component) and one for edges (geometries are composed of edges).

In general indexes (like BKD-tree) are used for search and docValues are used for aggregations.

(Note: I will have limited connectivity in the next weeks so it might take longer to reply)

1 Like

Thank you for your fast reply.
So the Polygon in the query gets represented as one or two (if it's a multipolygon) interval trees. Then the cells of the indexed BKD-Tree get compared to the one or two interval trees to see if the cells intersects with the polygon ?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.