I have some doubts about how the filter context works in Elasticsearch/Lucene ecosystem which I would be grateful if someone can clear up. This is related to another problem which I have been recently facing.
I have an index that stores DNS request/responses. The index is per day and older indices are force merged to 1 segment. Each daily index has around 50-60M docs (total 12-15G) divided in to 2 shards with replica set to 1.
Typical documents have following schema
{"source_ip": IP,
"destination_ip": IP,
"source_port": long,
"destination_port": long,
"sensor": keyword,
"query": keyword,
"answers": text,
"@timestamp",
... }
Most of entries have same destination_ip
while rest of the fields are different. The destination_ip
breakup is
-
10.11.100.100
- 50% -
10.138.100.100
- 35% - Others - 15%
Typically I need to search for (source_ip
, destination_ip
, query
, sensor
) within a time window and only return documents that match ALL these criteria. This search is against older indices (that have been force-merged). From the docs I understand bool
query with filter
is the best bet in such cases as that uses the filter context.
Using elasticsearch-py and elasticsearch-dsl
, I usually form queries like so
src_srch = (ESD.Search(using=es, index="dns-YYYY.MM.DD")
.filter("range", **{"@timestamp": {"lte": lseen, "gte": fseen}}
.filter("terms", **{"source_ip": ["xxxx"]})
.filter("terms", **{"destination_ip": ["xxxx"]})
.filter("terms", **{"sensor": ["xxxx"]})
.filter("terms", **{"query": ["xxxx"]}))
This gets translated to bool
with multiple filters
.
Does Elasticsearch do some query optimization on filters or does it pass this responsibility to Lucene? How does Lucene do this? Are the filters run in a particular order? Filter context caches the results so for above query will separate cache be created for each filter or single cache with aggregate results of all the filters?
The problem that I am hitting is that after upgrade to 7.16.1 the queries have become very slow (As shown in linked issue). Profiling the query of type:
{"source_ip": XX, "destination_ip": "10.11.100.100", "query": YY, "sensors": ZZ} in time period (a, b)
shows query takes 300ms on one shard (hits < 10) with bulk of time spent in filtering destination_ip:10.11.100.100
.
If I trim the query to remove destination_ip
{"source_ip": XX, "query": YY, "sensors": ZZ} in time period (a, b)
, profiler shows same query completing in < 1ms (Same number of hits) as most expensive filter is not run.
Are there any particular reasons this could be happening?