We run curator where I work and when running this, there are two issues:
- When running, the command I give it errors out. This means that I have to run it several times in order to delete all of the indices that meet the criteria I give it.
- When running the command, the cluster health goes to red rather than staying green. Once the command is complete, then the status goes back to green.
This is the command I am using, although I am also running this in cron.
/usr/bin/curator --host elasticsearch-host --http_auth curator:(key goes here) delete indices --older-than 7 --time-unit days --timestring '%Y.%m.%d' --prefix 'logstash-'
Any ideas as to why this would be happening with the command provided?
This will be very hard to troubleshoot without some indication of what errors you're seeing. Can you put the logs into a gist and link them here? Debug logs would be more helpful, but normal logs will likely reveal much.
Also, can you tell me which version of Elasticsearch you're operating against?
Without seeing those, I'm guessing that you have a rather large number of shards per node, and that the cluster state is getting beat up trying to delete them all in real time. There will likely be all kinds of timeouts visible in the logs.
Thank you for your reply. We currently are running Elasticsearch 2.2.3.
Here is the logfile you are requesting. Please let me know if there is anything else you need:
It's exactly what I thought it was. You're trying to delete an enormous number of indices in big chunks, and the server is timing out. See line #130 of your gist:
2016-01-25 19:08:31,926 ERROR Got a TIMEOUT response from Elasticsearch. Error message: HTTPConnectionPool(host=u'xxx1-elasticsearch-prod-vip', port=9200): Read timed out. (read timeout=30)
For cluster stability, I recommend deleting such a large number of indices in smaller batches. Since you have indices from last year, I recommend trying to delete the oldest month's worth first, then slowly iterate forward a month at a time.
If you want to try the painful, all-at-once approach, which may cause the cluster to become overburdened and very slow to respond, you still can. You'll need to increase the
--timeout value at the command-line. The default is 30 seconds. You can go as high as 300 seconds for an all-at once shot, but even that may not be enough. You'll be fighting the "master" timeout as well at the 300 second range, not just the client timeout. The master timeout is how long the master node is permitted to take before responding. This is tunable in Curator up to 300 seconds, because it's set to match the
--timeout (client timeout) value up to 300 seconds. The master timeout will not be increased beyond that point. It's not a good thing to have a master node be so unresponsive.