Cluster eventually starts giving absurdly wrong counts on search

We have a cluster running version 5.6.16. It has ~5.7k primary shards, ~2k indices and 28 nodes (3 masters, 3 coordinators and 22 data nodes):

{
  "cluster_name": "foo",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 28,
  "number_of_data_nodes": 22,
  "active_primary_shards": 5778,
  "active_shards": 11556,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

Eventually, whatever search we do in some indices return an insanely high doc count, even when no results are found.

For instance:

curl -s 'http://localhost:9200/*/_search?q=nope:thiswillneverexist&terminate_after=1' | jq -r '.'
{
  "took": 871,
  "timed_out": false,
  "terminated_early": false,
  "num_reduce_phases": 12,
  "_shards": {
    "total": 5778,
    "successful": 5778,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 9787770,
    "max_score": 2,
    "hits": []
  }
}

I couldn't find anything in the logs or any correlation with anything else, nor any other issues/foruns/etc (maybe I don't know what to search for exactly).

The only workaround we found was to restart all the data nodes :frowning:

Anyway, has everyone seen anything like this? Anything I could investigate?

Thanks!

FWIW, this happens even when no results should be found, and the count, once the issue happens, is always the same...

That looks weird.

Unfortunately this version is not maintained anymore so even if it's a bug, it won't be fixed.

I wonder if after upgrading to 6.8 or better 7.9 you can still see this problem.

Unfortunately this version is not maintained anymore so even if it's a bug, it won't be fixed.

yes, that was my fear...

I wonder if after upgrading to 6.8 or better 7.9 you can still see this problem.

Not that easy to do in our case unfortunately...


Still hope I can find some workaround a bit better than restarting the whole cluster though :slightly_frowning_face:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.