Failed to process cluster event (create-index-template) within 30s

usr20190913 · September 16, 2019, 4:19pm

If I am willing to throw away all the data and start indexing from scratch, will there still be a lot of configuration changes?

DavidTurner · September 16, 2019, 4:44pm

Sorry, it's hard to answer that without knowing a lot more. That's what the upgrade assistant and deprecation loggers are for.

usr20190913 · September 17, 2019, 1:46pm

usr20190913:

DavidTurner:

usr20190913:

Or is it something I can do server side?

You could try a firewall rule on all your Elasticsearch nodes blocking access to port 9200 (by the looks of your earlier logs) from most of the VMs. It doesn't have to be watertight as long as it gets enough of them.

I used the firewall rule and cut off the 1400 filebeat instances.
This allowed the master to catch up.
Essentially it was running with 20 pending tasks and the "task_max_waiting_in_queue_millis" was not climbing.

I then enabled 700 of instances and was hovering at:
 "number_of_pending_tasks" : 49,                                                                                                                                                                                                                         
  "task_max_waiting_in_queue_millis" : 29941,  
But that has now climbed to:
  "number_of_pending_tasks" : 158,                                                                                                                         
  "task_max_waiting_in_queue_millis" : 25214,  
And I still haven't enabled the additional 700 feeds.

Update ... all WORKING!!

Background:

We have a 5 different types of services feeding this Kibana server
All use filebeat
The 2 heaviest are 700 each

It seems when filebeat looses its connection with the master, when it re-establishes the connection, it sends these create-index-template calls by default (I checked with the documentation, that option is on by default, it was omitted in our config files).

It was these create-index-template that were essentially crushing the master.

Your suggestion to block or turn off some of those requests was good and fortunately, I was able to block some of my 5 different types of services with different firewall rules.

I blocked the 2 - 700 instance types first (the brunt of our logging).

Then monitored the master using:

$ curl http://10.249.1.121:9200/_cluster/health?pretty=true

{
  "cluster_name" : "logsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 65,
  "active_shards" : 130,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Watching the number_of_pending_tasks and task_max_waiting_in_queue_millis.

The other useful command (provided by you) was:

$ curl http://10.249.1.121:9200/_cluster/pending_tasks?pretty=true | grep source | sort | uniq -c

   3       "source": "_add_listener_",
   1       "source": "cluster_reroute(async_shard_fetch)",
 533       "source": "create-index-template [filebeat-6.1.2], cause [api]",
   6       "source": "put-mapping",
   1       "source": "put-pipeline-filebeat-6.1.1-iot-log4j-pipeline",
316087       "source": "shard-failed",

When these numbers settled down (i.e. zeroed out) the server caught up to the load, I removed one of the firewall blocks and let 700 new instances start sending in their data.

All these numbers climbed again, but were really hovering with

number_of_pending_tasks: 130ish
task_max_waiting_in_queue_millis: 30 seconds

instead of the massive numbers I was seeing earlier.

It took about 4 hours to catch up and have no pending tasks and 0 queue waiting time.

Then did it again for the next 700 instances.

Now that the filebeat is connected, it does not appear to be re-sending the create-index-templates, the logging is successful and the server is doing what it should.

Really appreciated your insights and suggestions.
David

DavidTurner · September 17, 2019, 2:21pm

Glad to hear it's working, and thanks for the useful summary!

I would still suggest setting the filebeat option setup.template.enabled: false across your estate as soon as you can, or else I think this might come back to bite you the next time you restart your cluster.

system · October 15, 2019, 2:21pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Process Cluster Event Timeout Exception Elasticsearch	5	2825	July 6, 2017
ElasticSearch with > 40 nodes, missing shards and indexing troubles Elasticsearch	11	658	July 6, 2017
Slow Shard Assignment Elasticsearch	6	1811	July 6, 2017
Startup issues with ES 1.3.5 Elasticsearch	22	1012	July 6, 2017
New index immediately becomes red Elasticsearch	8	2064	July 6, 2017

Failed to process cluster event (create-index-template) within 30s

Related topics