Failed to process cluster event (create-index-template) within 30s

If I am willing to throw away all the data and start indexing from scratch, will there still be a lot of configuration changes?

Sorry, it's hard to answer that without knowing a lot more. That's what the upgrade assistant and deprecation loggers are for.

Update ... all WORKING!!

Background:

  • We have a 5 different types of services feeding this Kibana server
  • All use filebeat
  • The 2 heaviest are 700 each

It seems when filebeat looses its connection with the master, when it re-establishes the connection, it sends these create-index-template calls by default (I checked with the documentation, that option is on by default, it was omitted in our config files).

It was these create-index-template that were essentially crushing the master.

Your suggestion to block or turn off some of those requests was good and fortunately, I was able to block some of my 5 different types of services with different firewall rules.

I blocked the 2 - 700 instance types first (the brunt of our logging).

Then monitored the master using:

$ curl http://10.249.1.121:9200/_cluster/health?pretty=true

{
  "cluster_name" : "logsearch",
  "status" : "green",
  "timed_out" : false,
  "number_of_nodes" : 6,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 65,
  "active_shards" : 130,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 100.0
}

Watching the number_of_pending_tasks and task_max_waiting_in_queue_millis.

The other useful command (provided by you) was:

$ curl http://10.249.1.121:9200/_cluster/pending_tasks?pretty=true | grep source | sort | uniq -c

   3       "source": "_add_listener_",
   1       "source": "cluster_reroute(async_shard_fetch)",
 533       "source": "create-index-template [filebeat-6.1.2], cause [api]",
   6       "source": "put-mapping",
   1       "source": "put-pipeline-filebeat-6.1.1-iot-log4j-pipeline",
316087       "source": "shard-failed",

When these numbers settled down (i.e. zeroed out) the server caught up to the load, I removed one of the firewall blocks and let 700 new instances start sending in their data.

All these numbers climbed again, but were really hovering with

number_of_pending_tasks: 130ish
task_max_waiting_in_queue_millis: 30 seconds

instead of the massive numbers I was seeing earlier.

It took about 4 hours to catch up and have no pending tasks and 0 queue waiting time.

Then did it again for the next 700 instances.

Now that the filebeat is connected, it does not appear to be re-sending the create-index-templates, the logging is successful and the server is doing what it should.

Really appreciated your insights and suggestions.
David

1 Like

Glad to hear it's working, and thanks for the useful summary!

I would still suggest setting the filebeat option setup.template.enabled: false across your estate as soon as you can, or else I think this might come back to bite you the next time you restart your cluster.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.