Elastic Performance for Spatial Queries Slows Down when Geometries are in Millions

I have a usecase to query ES for spatial intersection query with a Multipolygon. I tried two approaches where I store documents in flat structure, where 1 Polygon represents 1 document and in another setup I aggregate docs before ingesting and so 1 Polygon now represents n documents using nested type.

When I do spatial intersection query, I get very similar query time, average difference of + - 2sec but the reduction of documents between the two approaches is around 5x, meaning second approach has only 3 Million documents as compared to 13M previously

I want to understand what can I optimise in terms of my spatial query or index settings. I intend to have all my spatial queries perform at around 1s irrespective of polygon complexity. My intention is to get unique values for some metadata key based on a spatial intersection, in this case, areaName. I'm aware of different complexity of polygons will give different response times but comparing 13M vs 3M doc index, I expected a considerable improvement. I'm running everything using Python ES API.

On checking with Kibana Profiler, I understand that IndexOrDocValuesQuery takes longest time in which specifically build_scorer is taking 90% of the time.

Initially, I tried to ingest my documents in a flat structure with this kind of mapping:

{
  "mappings": {
    "properties": {
      "areaName": {
        "type": "keyword"
      },
      "location": {
        "type": "geo_shape"
      }
    }
  }
}

Here's a sample of docs for this index:

{
  "_index": "catalog.a",
  "_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
  "_score": 0.0025355585,
  "_source": {
  "location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
  "areaName": "Dataset A"
  },
  "_index": "catalog.a",
  "_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
  "_score": 0.0015355585,
  "_source": {
  "location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
  "areaName": "Dataset B"
  },
  "_index": "catalog.a",
  "_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
  "_score": 0.002355585,
  "_source": {
  "location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
  "areaName": "Dataset C"
  }
}

This index is around 80GB in size and has 3 primary shards and has around 13 Million such docs. I'm testing using this query for performance:

response = es.search(index = catalog.a, body =
        {
         "size":0,
          "query": {
            "bool": {
              "filter": [
                {
                  "geo_shape": {
                    "location": {
                      "shape": {
                        "type": "Multipolygon",
                        "coordinates": country_variable
                      },
                      "relation": "intersects"
                    }
                  }
                }
              ]
            }
          },
          "aggs": {
            "unique_soilProperty": {
              "terms": {
                "field": "areaName",
                  "size":15,
              }
            }
          }
        },
     request_cache=False
    )

Next, I changed my mapping to this:

{
  "mappings": {
    "location": {
        "type": "geo_shape"
      }
    "items": {
      "type": "nested",
      "properties": {
      "areaName": {
        "type": "keyword"
      },
    }
  }
}

My documents now look like this:

{
  "_index": "catalog.b",
  "_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
  "_score": 0.0025355585,
  "_source": {
  "location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
  "items":[{ "areaName": "Dataset A" }]
  },
  "_index": "catalog.b",
  "_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
  "_score": 0.0015355585,
  "_source": {
  "location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
  "items":[{ "areaName": "Dataset B" }]
  }}

This index is around 25GB in size and has 1 primary shard and has around 3 Million such docs. I'm testing using this query for performance:

response = es.search(index = catalog.b, body =
        {
        "size":0,
          "query": {
            "bool": {
              "filter": [
                {
                  "geo_shape": {
                    "location": {
                      "shape": {
                        "type": "Multipolygon",
                        "coordinates": country_variable
                      },
                      "relation": "intersects"
                    }
                  }
                }
              ]
            }
          },
          "aggs": {
            "nested_items": {
              "nested": {
                "path": "items"
              },
              "aggs": {
                "unique_dataSources": {
                  "terms": {
                    "field": "items.areaName",
                    "size": 15
                  }
                }
              }
            }
          }
        },
     request_cache=False
    )

I get very similar query performance even with this setup. What kind of optimisations I could do more, to speed it up?

hello,

Could you provide the following information?

  1. Elasticsearch version
  2. Index topology, e.g number of shards.
  3. output of hot threads when running the query.

Thanks

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.