I have a usecase to query ES for spatial intersection query with a Multipolygon
. I tried two approaches where I store documents in flat structure, where 1 Polygon represents 1 document
and in another setup I aggregate docs before ingesting and so 1 Polygon now represents n documents
using nested
type.
When I do spatial intersection query, I get very similar query time, average difference of + - 2sec
but the reduction of documents between the two approaches is around 5x, meaning second approach has only 3 Million
documents as compared to 13M
previously
I want to understand what can I optimise in terms of my spatial query or index settings. I intend to have all my spatial queries perform at around 1s
irrespective of polygon complexity. My intention is to get unique values for some metadata key
based on a spatial intersection
, in this case, areaName
. I'm aware of different complexity of polygons will give different response times but comparing 13M vs 3M doc index, I expected a considerable improvement. I'm running everything using Python ES API.
On checking with Kibana Profiler
, I understand that IndexOrDocValuesQuery
takes longest time in which specifically build_scorer
is taking 90% of the time.
Initially, I tried to ingest my documents in a flat structure with this kind of mapping:
{
"mappings": {
"properties": {
"areaName": {
"type": "keyword"
},
"location": {
"type": "geo_shape"
}
}
}
}
Here's a sample of docs for this index:
{
"_index": "catalog.a",
"_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
"_score": 0.0025355585,
"_source": {
"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
"areaName": "Dataset A"
},
"_index": "catalog.a",
"_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
"_score": 0.0015355585,
"_source": {
"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
"areaName": "Dataset B"
},
"_index": "catalog.a",
"_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
"_score": 0.002355585,
"_source": {
"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
"areaName": "Dataset C"
}
}
This index is around 80GB
in size and has 3 primary
shards and has around 13 Million such docs. I'm testing using this query for performance:
response = es.search(index = catalog.a, body =
{
"size":0,
"query": {
"bool": {
"filter": [
{
"geo_shape": {
"location": {
"shape": {
"type": "Multipolygon",
"coordinates": country_variable
},
"relation": "intersects"
}
}
}
]
}
},
"aggs": {
"unique_soilProperty": {
"terms": {
"field": "areaName",
"size":15,
}
}
}
},
request_cache=False
)
Next, I changed my mapping to this:
{
"mappings": {
"location": {
"type": "geo_shape"
}
"items": {
"type": "nested",
"properties": {
"areaName": {
"type": "keyword"
},
}
}
}
My documents now look like this:
{
"_index": "catalog.b",
"_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
"_score": 0.0025355585,
"_source": {
"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
"items":[{ "areaName": "Dataset A" }]
},
"_index": "catalog.b",
"_id": "71c0fca6-45a0-4c15-942a-a31a674f5d0d",
"_score": 0.0015355585,
"_source": {
"location": "POLYGON ((-76.4891423210711707 37.0441858681596372, -76.4891423210711707 37.0503207513172370))",
"items":[{ "areaName": "Dataset B" }]
}}
This index is around 25GB
in size and has 1 primary
shard and has around 3 Million such docs. I'm testing using this query for performance:
response = es.search(index = catalog.b, body =
{
"size":0,
"query": {
"bool": {
"filter": [
{
"geo_shape": {
"location": {
"shape": {
"type": "Multipolygon",
"coordinates": country_variable
},
"relation": "intersects"
}
}
}
]
}
},
"aggs": {
"nested_items": {
"nested": {
"path": "items"
},
"aggs": {
"unique_dataSources": {
"terms": {
"field": "items.areaName",
"size": 15
}
}
}
}
}
},
request_cache=False
)
I get very similar query performance even with this setup. What kind of optimisations I could do more, to speed it up?