Hey there, I am working on a Elasticsearch cluster upgrade automation tool. For demo purposes (to show that my upgrade can achieve zero downtime), I have written a Python program that constantly streams data into the cluster while it is being upgraded:
#get a Python client es = Elasticsearch( [HOST_NAME + ":" + str(HTTP_PORT)], retry_on_timeout = True, sniff_on_start = True, sniff_on_connection_fail = True, sniff_timeout = 60 )
In the above code snippet,
HTTP_PORT are the IP address and HTTP port for one of the nodes in the cluster (prior to upgrade). However, I have chosen an out-of-place upgrading strategy such that all the old cluster node (with lower Elasticsearch version) will be eventually decommissioned (after all their shards were relocated to newly created nodes with higher Elasticsearch version). When the old nodes are decommissioned, the Python client encounters the following error:
Traceback (most recent call last): File "main.py", line 51, in <module> start() File "main.py", line 48, in start ingest_log_stream(INDEX_NAME, INPUT_DATA_FILE, GAP) File "data_stream_ingestor.py", line 19, in ingest_log_stream ingest_log_entry(indexName, logEntry) File "data_ingestor.py", line 25, in ingest_log_entry es = get_es_connection() File "es_connector.py", line 19, in get_es_connection ], sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60) File "/home/.local/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 206, in __init__ self.transport = transport_class(_normalize_hosts(hosts), **kwargs) File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 141, in __init__ self.sniff_hosts(True) File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 261, in sniff_hosts node_info = self._get_sniff_data(initial) File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 230, in _get_sniff_data raise TransportError("N/A", "Unable to sniff hosts.") elasticsearch.exceptions.TransportError: TransportError(N/A, 'Unable to sniff hosts.')
The Elasticsearch Python client lib docs suggests that
If a connection to a node fails due to connection issues (raises ConnectionError) it is considered in faulty state. It will be placed on hold for dead_timeout seconds and the request will be retried on another node. If a connection fails multiple times in a row the timeout will get progressively larger to avoid hitting a node that’s, by all indication, down. If no live connection is available, the connection that has the smallest timeout will be used.
However, it seems to me that having
retry_on_timeout and other
sniffing options set does not resolve the issue. I am wondering what would be the correct way to instantiate a Elasticsearch client so that in case the node it connects to goes down, it automatically tries to connect to other nodes in the cluster? Thanks!