Elastcisearch - curator script for delete indices

Good day.

elasticsearch 5.3.0
elasticsearch-curator (4.2.6)

We have python script for delete old indices from Elasticsearch. Python script receive list of indices that need remove:

delete = curator.DeleteIndices(indices)
delete.do_action()

This script run every day, works 50/50% - may be all delete good in one day, may be rise exceptions:

2018-04-07 11:00:07,201 - curator.actions.delete_indices - DEBUG - master_timeout value: 30s
2018-04-07 11:00:07,202 - curator - INFO - Going to delete ['filebeat-90d-2018.01.07', 'filebeat-180d-2017.10.09']
2018-04-07 11:00:07,202 - curator.indexlist - DEBUG - Checking for empty list
2018-04-07 11:00:07,202 - curator.actions.delete_indices - INFO - Deleting selected indices: ['filebeat-90d-2018.01.07', 'filebeat-180d-2017.10.09'']
2018-04-07 11:00:07,202 - curator.actions.delete_indices - INFO - ---deleting index filebeat-90d-2018.01.07
2018-04-07 11:00:07,202 - curator.actions.delete_indices - INFO - ---deleting index filebeat-180d-2017.10.09
....
2018-04-07 11:00:37,559 - curator.actions.delete_indices - ERROR - The following indices failed to delete on try #1:
2018-04-07 11:00:37,559 - curator.actions.delete_indices - ERROR - ---filebeat-180d-2017.10.09

2018-04-07 11:00:37,560 - curator.actions.delete_indices - INFO - ---deleting index filebeat-180d-2017.10.09
2018-04-07 11:00:47,155 - curator - ERROR - Exception occurred!
Traceback (most recent call last):
File "/var/lib/elasticsearch/.local/lib/python3.4/site-packages/curator/actions.py", line 482, in do_action
self.__chunk_loop(l)
File "/var/lib/elasticsearch/.local/lib/python3.4/site-packages/curator/actions.py", line 455, in __chunk_loop
index=to_csv(working_list), master_timeout=self.master_timeout)
File "/var/lib/elasticsearch/.local/lib/python3.4/site-packages/elasticsearch/client/utils.py", line 69, in _wrapped
return func(*args, params=params, **kwargs)
File "/var/lib/elasticsearch/.local/lib/python3.4/site-packages/elasticsearch/client/indices.py", line 201, in delete
params=params)
File "/var/lib/elasticsearch/.local/lib/python3.4/site-packages/elasticsearch/transport.py", line 327, in perform_request
status, headers, data = connection.perform_request(method, url, params, body, ignore=ignore, timeout=timeout)
File "/var/lib/elasticsearch/.local/lib/python3.4/site-packages/elasticsearch/connection/http_urllib3.py", line 110, in perform_request
self._raise_error(response.status, raw_data)

File "/var/lib/elasticsearch/.local/lib/python3.4/site-packages/elasticsearch/connection/base.py", line 114, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(status_code, error_message, additional_info)
elasticsearch.exceptions.NotFoundError: TransportError(404, 'index_not_found_exception', 'no such index')

In my opinion, problem is related to curator try delete indices, wait 30 sec (11:00:07 + 30 sec), did not receive response from node, then try again remove indices, but that indice already removed.
I set master_timeout to 60 sec

delete = curator.DeleteIndices(indices, 60)

in log view that master_timeout is set to 60 sec + on ES view request:

DELETE /Name_indices?master_timeout=60s

but in log file from python script, curator module all the same after 30 seconds try again remove indices.

Connection settings to connect ES from script:

client = elasticsearch.Elasticsearch([{ 'host': es_host, 'port': es_port, }], timeout=120)

How curator set timeout > 30 sec in python script, that it try remove indices, if first attempt failed in that interval? Or may be error with something else?

It sounds like your cluster is overloaded. A simple retry mechanism was added to the delete_indices action because of cases when users expect the delete to happen immediately, but it takes a long time. The master_timeout only helps so much in this case. It's job is to extend longer than the default 30 seconds for the elected master node to complete the index deletes. Curator waits for the client to return, and then tests to see if the indices were, in fact, deleted before proceeding. Sometimes the return call signals that the delete was completed, but the cluster state still hasn't been updated quickly enough, so a quick call to get all indices has it appear that the delete hasn't actually happened, because the cluster state still reports the index's presence. Based on what you've shared, I'm fairly certain this is what's happening to you. Your cluster is overloaded, and Curator gets an outdated cluster state. When Curator re-tries to delete the index, it gets the index_not_found exception, because the cluster state suddenly caught up. I repeat: This only happens on overloaded clusters. A healthy cluster will never run into this problem. Curator naïvely attempts to compensate for it, but can only do so much.

As another side note, you're using a very old version of Curator (4.2.6), when the most recent (5.5.1) will work with your version of Elasticsearch (5.3.0). You will probably have to upgrade your python version from 3.4 to 3.5 or 3.6 to use it, however.

thank you very much! we will check our cluster for load

Cluster load takes many shapes. Even if the CPU/Memory load does not look very high, having too many shards on a node will result in the delete problem you shared. You should never have more than 1000 open shards on a node that has a 31G heap. That number shrinks correspondingly with the heap size.

I have suspicion that cluster load may bound with hardware disk overloaded - two 2Tb disks have 100 IOPS on write average

That would do it too :slight_smile:

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.