Getting Inconsistent Results from subsequent queries when specifying own cache key for terms lookup filters

Looking for a little guidance/suggestions on how to debug an issue I'm having. I'm performing a non-trivial search and each time I execute the same search, I get back one of two results. One result involves 3 hits and the other result is about a dozen hits (which includes the 3 hits from the first result). So, for example, I will get back results with document IDs A, D, and G on one execution and the next will get A, B, C, D, E, F, G. If i keep executing the search in succession, it will toggle back and forth between the two results.

My search contains a number of nested boolean queries, function score queries, etc. I am also using a post-filter with 6 different terms lookup queries that are being cached and are set to expire every hour.

from my elasticsearch.yml...

indices.cache.filter.terms.expire_after_access: 3600s
indices.cache.filter.terms.expire_after_write: 3600s

I have a 3 node cluster running v1.7.1. There are 5 shards and 2 replicas for the index, so each node has all the shards on it (either a primary or replica).

This issue happens sporadically and isn't reproducible (at least I haven't discovered how to reproduce on demand yet). When it does happen, I can query each of the nodes and the cluster and the problem will only occur on a single node. Haven't been able to figure out yet if it is the same node that has problems. There is activity (indexing/searching) occurring on my cluster while I'm executing these searches but not anything that would effect the documents that should match this query.

Any help anyone can offer would be greatly appreciated. At this point just trying to isolate the issue so that I can make a targeted fix rather than just blindly making changes to my query structure (like simplifying my post filters).

Thanks

I've been able to make some progress debugging this and it appears the issue lies in the terms lookup filters. My filters look like...

{
    "terms": {
        "_id": {
            "type": "group",
            "id": "7",
            "path": "groupvals"
        },
        "_cache_key": "group_7_vals"
    }
}

I specify the _cache_key so that I can clear the cache manually when I update the group document. If I remove my _cache_key from the query, I get back consistent results. Additionally, if I specify "_cache" : false in my terms lookup filters, I get consistent results as well.

So, it seems to be an issue with manually specifying the _cache_key on a terms lookup filter.

Anyone run into this before?

We are facing a similar problem using the _cache_key wherein we see inconsistent results while doing a simple terms lookup. We are currently using elasticsearch 1.6.3 and have a three node cluster setup.

Strangely, the output varies on every third query we make to the cluster e.g. we consistently receive 0, 0, 2, 0, 0, 2... as the doc count and it appears that the cached value is not being synchronized across all the nodes. The logs within each node do not indicate anything untoward either.

I see this was raised in 2015. @pmichel Were you able to find out the reason for the inconsistency?

Has anyone else faced such a problem?

I just ended up removing use of the cache. It wasn't doing much for me.