If I am willing to throw away all the data and start indexing from scratch, will there still be a lot of configuration changes?
Sorry, it's hard to answer that without knowing a lot more. That's what the upgrade assistant and deprecation loggers are for.
Update ... all WORKING!!
Background:
- We have a 5 different types of services feeding this Kibana server
- All use filebeat
- The 2 heaviest are 700 each
It seems when filebeat looses its connection with the master, when it re-establishes the connection, it sends these create-index-template calls by default (I checked with the documentation, that option is on by default, it was omitted in our config files).
It was these create-index-template that were essentially crushing the master.
Your suggestion to block or turn off some of those requests was good and fortunately, I was able to block some of my 5 different types of services with different firewall rules.
I blocked the 2 - 700 instance types first (the brunt of our logging).
Then monitored the master using:
$ curl http://10.249.1.121:9200/_cluster/health?pretty=true
{
"cluster_name" : "logsearch",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 6,
"number_of_data_nodes" : 3,
"active_primary_shards" : 65,
"active_shards" : 130,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 100.0
}
Watching the number_of_pending_tasks and task_max_waiting_in_queue_millis.
The other useful command (provided by you) was:
$ curl http://10.249.1.121:9200/_cluster/pending_tasks?pretty=true | grep source | sort | uniq -c
3 "source": "_add_listener_",
1 "source": "cluster_reroute(async_shard_fetch)",
533 "source": "create-index-template [filebeat-6.1.2], cause [api]",
6 "source": "put-mapping",
1 "source": "put-pipeline-filebeat-6.1.1-iot-log4j-pipeline",
316087 "source": "shard-failed",
When these numbers settled down (i.e. zeroed out) the server caught up to the load, I removed one of the firewall blocks and let 700 new instances start sending in their data.
All these numbers climbed again, but were really hovering with
number_of_pending_tasks: 130ish
task_max_waiting_in_queue_millis: 30 seconds
instead of the massive numbers I was seeing earlier.
It took about 4 hours to catch up and have no pending tasks and 0 queue waiting time.
Then did it again for the next 700 instances.
Now that the filebeat is connected, it does not appear to be re-sending the create-index-templates, the logging is successful and the server is doing what it should.
Really appreciated your insights and suggestions.
David
Glad to hear it's working, and thanks for the useful summary!
I would still suggest setting the filebeat option setup.template.enabled: false across your estate as soon as you can, or else I think this might come back to bite you the next time you restart your cluster.