Cluster overflooded with tasks and growing!

Logstash are not able to ingest to the cluster now and tasks are still growing exponentially and no signs of slowing down. There are already 400k+ as of now. There are tasks that are running for days and new tasks inserting.

Data node A is having the below from GET /.tasks

"node_failures" : [
  "type" : "failed_node_exception",
  "reason" : "Failed node [Fq7Dm5VRRXGbYanMeysPRw]",
  "node_id" : "Fq7Dm5VRRXGbYanMeysPRw",
  "caused_by" : {
    "type" : "transport_serialization_exception",
    "reason" : "Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Unknown NamedWriteable [org.elasticsearch.tasks.Task$Status][resync]"


Most of the 400k tasks are running on data node A. After restarting it, it just shifted the tasks to data node B. Restarting the elected master does not help too.

Most es nodes are also throwing the below errors

failed to index audit event:[access_granted]. internal queue is full. which may be caused by a high indexing rate or issue with the destination.

The cluster state is still healthy....

  "cluster_name" : "prod-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 4821,
 "active_shards" : 9643,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
 "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0

This happens after an upgrade from 6.4.1 to 6.5.4. How can I troubleshoot this....

Thanks much.

This was fixed in 6.6.0 - see below. It is a symptom of an overloaded cluster and not the source of your problem, but unfortunately it gets in the way of diagnosing the actual cause.

Thanks for your reply.

We have stopped the logstash nodes since hours ago and brought it up and the below error went away.

Like you have said its not the source of the problem but how can we flush/reduce the tasks running in there? Assuming that we are not sure of the root cause..

Step 1 is to find the root cause.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.