Elasticsearch Python client: how to resume connection after node is stopped

Hey there, I am working on a Elasticsearch cluster upgrade automation tool. For demo purposes (to show that my upgrade can achieve zero downtime), I have written a Python program that constantly streams data into the cluster while it is being upgraded:

#get a Python client
es = Elasticsearch(
    [HOST_NAME + ":" + str(HTTP_PORT)],
    retry_on_timeout = True,
    sniff_on_start = True, 
    sniff_on_connection_fail = True,
   sniff_timeout = 60
)

In the above code snippet, HOST_NAME and HTTP_PORT are the IP address and HTTP port for one of the nodes in the cluster (prior to upgrade). However, I have chosen an out-of-place upgrading strategy such that all the old cluster node (with lower Elasticsearch version) will be eventually decommissioned (after all their shards were relocated to newly created nodes with higher Elasticsearch version). When the old nodes are decommissioned, the Python client encounters the following error:

Traceback (most recent call last):
  File "main.py", line 51, in <module>
    start()
  File "main.py", line 48, in start
    ingest_log_stream(INDEX_NAME, INPUT_DATA_FILE, GAP)
  File "data_stream_ingestor.py", line 19, in ingest_log_stream
    ingest_log_entry(indexName, logEntry)
  File "data_ingestor.py", line 25, in ingest_log_entry
    es = get_es_connection()
  File "es_connector.py", line 19, in get_es_connection
    ], sniff_on_start=True, sniff_on_connection_fail=True, sniffer_timeout=60)
  File "/home/.local/lib/python3.6/site-packages/elasticsearch/client/__init__.py", line 206, in __init__
    self.transport = transport_class(_normalize_hosts(hosts), **kwargs)
  File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 141, in __init__
    self.sniff_hosts(True)
  File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 261, in sniff_hosts
    node_info = self._get_sniff_data(initial)
  File "/home/.local/lib/python3.6/site-packages/elasticsearch/transport.py", line 230, in _get_sniff_data
    raise TransportError("N/A", "Unable to sniff hosts.")
elasticsearch.exceptions.TransportError: TransportError(N/A, 'Unable to sniff hosts.')

The Elasticsearch Python client lib docs suggests that
If a connection to a node fails due to connection issues (raises ConnectionError) it is considered in faulty state. It will be placed on hold for dead_timeout seconds and the request will be retried on another node. If a connection fails multiple times in a row the timeout will get progressively larger to avoid hitting a node that’s, by all indication, down. If no live connection is available, the connection that has the smallest timeout will be used.
However, it seems to me that having retry_on_timeout and other sniffing options set does not resolve the issue. I am wondering what would be the correct way to instantiate a Elasticsearch client so that in case the node it connects to goes down, it automatically tries to connect to other nodes in the cluster? Thanks!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.