Logstash are not able to ingest to the cluster now and tasks are still growing exponentially and no signs of slowing down. There are already 400k+ as of now. There are tasks that are running for days and new tasks inserting.
Data node A is having the below from GET /.tasks
"node_failures" : [
{
"type" : "failed_node_exception",
"reason" : "Failed node [Fq7Dm5VRRXGbYanMeysPRw]",
"node_id" : "Fq7Dm5VRRXGbYanMeysPRw",
"caused_by" : {
"type" : "transport_serialization_exception",
"reason" : "Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]",
"caused_by" : {
"type" : "illegal_argument_exception",
"reason" : "Unknown NamedWriteable [org.elasticsearch.tasks.Task$Status][resync]"
}
}
}
],
Most of the 400k tasks are running on data node A. After restarting it, it just shifted the tasks to data node B. Restarting the elected master does not help too.
Most es nodes are also throwing the below errors
failed to index audit event:[access_granted]. internal queue is full. which may be caused by a high indexing rate or issue with the destination.
The cluster state is still healthy....
{
"cluster_name" : "prod-cluster",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 8,
"number_of_data_nodes" : 3,
"active_primary_shards" : 4821,
"active_shards" : 9643,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
This happens after an upgrade from 6.4.1 to 6.5.4. How can I troubleshoot this....
Thanks much.