Understanding elasticsearch filter context

I have some doubts about how the filter context works in Elasticsearch/Lucene ecosystem which I would be grateful if someone can clear up. This is related to another problem which I have been recently facing.

I have an index that stores DNS request/responses. The index is per day and older indices are force merged to 1 segment. Each daily index has around 50-60M docs (total 12-15G) divided in to 2 shards with replica set to 1.

Typical documents have following schema

{"source_ip": IP,
 "destination_ip": IP,
 "source_port": long,
 "destination_port": long,
 "sensor": keyword,
 "query": keyword,
 "answers": text,
 ... }

Most of entries have same destination_ip while rest of the fields are different. The destination_ip breakup is

  • - 50%
  • - 35%
  • Others - 15%

Typically I need to search for (source_ip, destination_ip, query, sensor) within a time window and only return documents that match ALL these criteria. This search is against older indices (that have been force-merged). From the docs I understand bool query with filter is the best bet in such cases as that uses the filter context.

Using elasticsearch-py and elasticsearch-dsl, I usually form queries like so

src_srch = (ESD.Search(using=es, index="dns-YYYY.MM.DD")
            .filter("range", **{"@timestamp": {"lte": lseen, "gte": fseen}}
            .filter("terms", **{"source_ip": ["xxxx"]})
            .filter("terms", **{"destination_ip": ["xxxx"]})
            .filter("terms", **{"sensor": ["xxxx"]})
            .filter("terms", **{"query": ["xxxx"]}))

This gets translated to bool with multiple filters.

Does Elasticsearch do some query optimization on filters or does it pass this responsibility to Lucene? How does Lucene do this? Are the filters run in a particular order? Filter context caches the results so for above query will separate cache be created for each filter or single cache with aggregate results of all the filters?

The problem that I am hitting is that after upgrade to 7.16.1 the queries have become very slow (As shown in linked issue). Profiling the query of type:

{"source_ip": XX, "destination_ip": "", "query": YY, "sensors": ZZ} in time period (a, b)

shows query takes 300ms on one shard (hits < 10) with bulk of time spent in filtering destination_ip:

If I trim the query to remove destination_ip

{"source_ip": XX, "query": YY, "sensors": ZZ} in time period (a, b), profiler shows same query completing in < 1ms (Same number of hits) as most expensive filter is not run.

Are there any particular reasons this could be happening?


so filters in a bool query sound like the most efficient idea here, as long as you don't need any scoring. What happens internally is, that Lucene is able to cache each filter separately, if it is being reused a certain number of times. See also Elasticsearch caching deep dive: Boosting query speed one cache at a time | Elastic Blog

That said, I suppose it makes more sense to focus on the issue in the other post you wrote first, see if that can get fixed and then go from there.


Thanks @spinscale for the explanation regarding Lucene caching!

I found a post explaining some of these questions. It's not the recent one, but I suspect the general framework is still the same.

And also, though I'm not sure it will mitigate the problem, Index Sorting looks related to the problem.

Thanks, I had seen this link but was hoping for something more recent.

Index sorting unfortunately will not fit the bill as the indexes are write heavy.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.