Hey there, I am working on a Elasticsearch cluster upgrade automation tool. For demo purposes (to show that my upgrade can achieve zero downtime), I have written a Python program that constantly streams data into the cluster while it is being upgraded:
#get a Python client
es = Elasticsearch(
[HOST_NAME + ":" + str(HTTP_PORT)],
retry_on_timeout = True,
sniff_on_start = True,
sniff_on_connection_fail = True,
sniff_timeout = 60
)
In the above code snippet, HOST_NAME
and HTTP_PORT
are the IP address and HTTP port for one of the nodes in the cluster (prior to upgrade). However, I have chosen an out-of-place upgrading strategy such that all the old cluster node (with lower Elasticsearch version) will be eventually decommissioned (after all their shards were relocated to newly created nodes with higher Elasticsearch version). When the old nodes are decommissioned, the Python client encounters the following error:
Traceback (most recent call last):
File "main.py", line 51, in <module>
start()
File "main.py", line 48, in start
ingest_log_stream(INDEX_NAME, INPUT_DATA_FILE, GAP)
File "data_stream_ingestor.py", line 19, in ingest_log_stream
ingest_log_entry(indexName, logEntry)
File "data_ingestor.py", line 25, in ingest_log_entry
es = get_es_connection()
File "es_connector.py", line 19, in get_es_connection
], sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)
File "/home/.local/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 206, in __init__
self.transport = transport_class(_normalize_hosts(hosts), **kwargs)
File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 141, in __init__
self.sniff_hosts(True)
File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 261, in sniff_hosts
node_info = self._get_sniff_data(initial)
File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 230, in _get_sniff_data
raise TransportError("N/A", "Unable to sniff hosts.")
elasticsearch.exceptions.TransportError: TransportError(N/A, 'Unable to sniff hosts.')
The Elasticsearch Python client lib docs suggests that
If a connection to a node fails due to connection issues (raises ConnectionError) it is considered in faulty state. It will be placed on hold for dead_timeout seconds and the request will be retried on another node. If a connection fails multiple times in a row the timeout will get progressively larger to avoid hitting a node that’s, by all indication, down. If no live connection is available, the connection that has the smallest timeout will be used.
However, it seems to me that having retry_on_timeout
and other sniffing
options set does not resolve the issue. I am wondering what would be the correct way to instantiate a Elasticsearch client so that in case the node it connects to goes down, it automatically tries to connect to other nodes in the cluster? Thanks!