Failed to process cluster event (create-index-template) within 30s

If I am willing to throw away all the data and start indexing from scratch, will there still be a lot of configuration changes?

Sorry, it's hard to answer that without knowing a lot more. That's what the upgrade assistant and deprecation loggers are for.

Update ... all WORKING!!

Background:

  • We have a 5 different types of services feeding this Kibana server
  • All use filebeat
  • The 2 heaviest are 700 each

It seems when filebeat looses its connection with the master, when it re-establishes the connection, it sends these create-index-template calls by default (I checked with the documentation, that option is on by default, it was omitted in our config files).

It was these create-index-template that were essentially crushing the master.

Your suggestion to block or turn off some of those requests was good and fortunately, I was able to block some of my 5 different types of services with different firewall rules.

I blocked the 2 - 700 instance types first (the brunt of our logging).

Then monitored the master using:

$ curl http://10.249.1.121:9200/_cluster/health?pretty=true

{
  "cluster_name" : "logsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 65,
  "active_shards" : 130,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Watching the number_of_pending_tasks and task_max_waiting_in_queue_millis.

The other useful command (provided by you) was:

$ curl http://10.249.1.121:9200/_cluster/pending_tasks?pretty=true | grep source | sort | uniq -c

   3       "source": "_add_listener_",
   1       "source": "cluster_reroute(async_shard_fetch)",
 533       "source": "create-index-template [filebeat-6.1.2], cause [api]",
   6       "source": "put-mapping",
   1       "source": "put-pipeline-filebeat-6.1.1-iot-log4j-pipeline",
316087       "source": "shard-failed",

When these numbers settled down (i.e. zeroed out) the server caught up to the load, I removed one of the firewall blocks and let 700 new instances start sending in their data.

All these numbers climbed again, but were really hovering with

number_of_pending_tasks: 130ish
task_max_waiting_in_queue_millis: 30 seconds

instead of the massive numbers I was seeing earlier.

It took about 4 hours to catch up and have no pending tasks and 0 queue waiting time.

Then did it again for the next 700 instances.

Now that the filebeat is connected, it does not appear to be re-sending the create-index-templates, the logging is successful and the server is doing what it should.

Really appreciated your insights and suggestions.
David

Glad to hear it's working, and thanks for the useful summary!

I would still suggest setting the filebeat option setup.template.enabled: false across your estate as soon as you can, or else I think this might come back to bite you the next time you restart your cluster.