Cluster overflooded with tasks and growing!

dropchew · March 1, 2019, 5:48am

Logstash are not able to ingest to the cluster now and tasks are still growing exponentially and no signs of slowing down. There are already 400k+ as of now. There are tasks that are running for days and new tasks inserting.

Data node A is having the below from GET /.tasks

"node_failures" : [
{
  "type" : "failed_node_exception",
  "reason" : "Failed node [Fq7Dm5VRRXGbYanMeysPRw]",
  "node_id" : "Fq7Dm5VRRXGbYanMeysPRw",
  "caused_by" : {
    "type" : "transport_serialization_exception",
    "reason" : "Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Unknown NamedWriteable [org.elasticsearch.tasks.Task$Status][resync]"
    }
  }
}

],

Most of the 400k tasks are running on data node A. After restarting it, it just shifted the tasks to data node B. Restarting the elected master does not help too.

Most es nodes are also throwing the below errors

failed to index audit event：[access_granted]. internal queue is full. which may be caused by a high indexing rate or issue with the destination.

The cluster state is still healthy....

{
  "cluster_name" : "prod-cluster",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 8,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 4821,
 "active_shards" : 9643,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
 "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

This happens after an upgrade from 6.4.1 to 6.5.4. How can I troubleshoot this....

Thanks much.

DavidTurner · March 1, 2019, 7:28am

This was fixed in 6.6.0 - see below. It is a symptom of an overloaded cluster and not the source of your problem, but unfortunately it gets in the way of diagnosing the actual cause.

github.com/elastic/elasticsearch

Register ResyncTask.Status as a NamedWriteable

elastic:master ← DaveCTurner:2018-12-13-resync-task-status-namedwriteable-registration

opened 05:57PM - 13 Dec 18 UTC

DaveCTurner

+21 -0

Today, ResyncTask.Status is not registered, but appears as a task status someti…mes, leading to `Failed to deserialize response from handler` exceptions: java.lang.IllegalArgumentException: Unknown NamedWriteable [org.elasticsearch.tasks.Task$Status][resync] This commit adds the missing registration.

dropchew · March 1, 2019, 7:48am

Thanks for your reply.

We have stopped the logstash nodes since hours ago and brought it up and the below error went away.

Like you have said its not the source of the problem but how can we flush/reduce the tasks running in there? Assuming that we are not sure of the root cause..

DavidTurner · March 1, 2019, 8:11am

Step 1 is to find the root cause.

system · March 29, 2019, 8:11am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.