Pre-filter points in geo(tile) aggregations

Elasticsearch supports bounds for geo-tile aggregations. Is there any way to filter the points before the transformation to tiles?

We're considering scripted fields, plugins, and other things. But it would be ideal if we could do this with native ES features.

Thanks!

I am not sure if I fully understand your question but in order to filter points you might want to use any of the provided geo queries: Geo queries | Elasticsearch Guide [8.17] | Elastic

Not the the bounds you are referring are not filtering points but the generated tiles. Those bounds are not applied to the points but to the tiles, e.g it filter out any tile that is disjoint with the bounds.

Hi Ignacio,

Sorry for not being more clear earlier. That's exactly my point though. We're already using a geoshape query to filter documents. But we're working with data where documents have multiple locations. So a document might match the geoshape query, but also contain points outside of that query. These points are still fed through the agregation.

We can filter the resulting tiles through the bounds of the geo-tile aggregation. But I was wondering whether the possibility exists to filter the points before transformation to tiles.

I.e., effectively, in GeoTileCellIdSource I would like to make the following change:

@Override
protected NumericDocValues boundedCellSingleValue(GeoPointValues values, GeoBoundingBox boundingBox) {
    final GeoTileBoundedPredicate predicate = new GeoTileBoundedPredicate(precision(), boundingBox);
    final int tiles = 1 << precision();
    return new CellSingleValue(values, precision()) {
        @Override
        protected boolean advance(org.elasticsearch.common.geo.GeoPoint target) {

            // ---- added filtering -----------------------------------------
            if (boundingBox.pointInBounds(target.getLon(), target.getLat()) == false) {
                return false;
            }
            // ---- end of added filtering ----------------------------------

            final int x = GeoTileUtils.getXTile(target.getLon(), tiles);
            final int y = GeoTileUtils.getYTile(target.getLat(), tiles);
            if (predicate.validTile(x, y, precision)) {
                value = GeoTileUtils.longEncodeTiles(precision, x, y);
                return true;
            }
            return false;
        }
    };
}

(and something similar for boundedCellMultiValues)

I understand this gives some potentially odd results if the bounds of the query filter don't match the bounds for the aggregation. Also, when the bounds don't align with the geo-tile grid you'll get some 'interesting' results at the edges.

If there is no other option, we'll probably filter the points with a runtime mapping with a painless script.

I just noticed that there seems to be a difference between geo-tile aggregations and geohash/-hex aggregations. The latter two seem to also filter points before aggregation in contrast to geo-tile. E.g., GeoHashCellIdSource#boundedCellSingleValue contains pointInBounds(target.getLon(), target.getLat()):

final String hash = Geohash.stringEncode(target.getLon(), target.getLat(), precision);
if (pointInBounds(target.getLon(), target.getLat()) || predicate.validHash(hash)) {
    value = Geohash.longEncode(hash);
    return true;
}
return false;

What I essentially would like to do is be able to change the || to &&.

I understand. Unfortunately the changes you are proposing will break the implemented functionality in the vector tile search API. The current behaviour is intentional and what you are proposing is a new functionality that currently does not exist.

The way it currently works is that we search for all geometries intersecting the query geometry, then we tile the full geometries.

What you want is to search for all geometries intersecting the query , then tile the intersection of those geometries with the query.

At the moment your only option is to filter the points (build the intersection) with a runtime mapping.

I understand this would break existing queries, which obviously is a no-go.

We'll give runtime fields a try then. Are there any performance considerations to bear in mind?

Just make sure you are reading the points using doc values, e.g via doc["field"] construct.

Thanks Ignacio!

1 Like

Looking a bit deeper into runtime fields, we're hitting the AbstractFieldScript.MAX_VALUES limit. In our solution, the use of such fields would be ideal, but we have a lot of documents with many more than 100 values for a field.

I understand that we're approaching the limits of ES here. I would be grateful for suggestions.

I can only think in one solution but requires the cluster to be in a license higher than basic:

The idea is to create a geo_shape runtime field instead of a geo_point runtime field. You can then apply the geotile aggregation on the geo_shape field, although this requires a license.

Something like:

GET points/_search
{
  "fields": [
    "multipoint"
  ],
  "runtime_mappings": {
    "multipoint": {
      "type": "geo_shape",
      "script": """
        StringBuilder sb = new StringBuilder("multipoint(");
        def v = doc["location"];
        for (int i = 0; i < v.length; i++) {
          sb.append(v[i].lon).append(" ").append(v[i].lat);
          if (i != v.length - 1) sb.append(",");
        }
        sb.append(")");
        emit(sb.toString());
        """
    }
  }
}