Frequent timeout when using elasticsearch-py 8.1.0

We have recently upgraded our Elasticsearch to 8.1.0.

We have a some Python scripts which perform some searching/computations on data in the Elasticsearch cluster which use elasticsearch-py. Due to systems loads, during early days of deployment (around Elasticsearch-7.9) we noticed timeouts getting hit. So we configured our clients to have larger timeouts and also more retries. The scripts worked without issues till 7.17.0

After update to 8.1.0, we have notied a lot more timeouts in the scripts. We have not client changed configuration between upgrades. Recently we enabled elastic_transport logging to see if more details could be found.

We found the following error there

2022-03-28 16:18:27,468 WARNI [elastic_transport.node_pool][_node_pool.mark_dead] Node <Urllib3HttpNode(http://10.44.0.48:9200)> has failed for 1 times in a row, putting on 1 second timeout
2022-03-28 16:18:27,468 WARNI [elastic_transport.transport][_transport.perform_request] Retrying request after failure (attempt 0 of 5)
Traceback (most recent call last):
  File "/home/admin/essproc/venv/lib/python3.6/site-packages/elastic_transport/_transport.py", line 334, in perform_request
    request_timeout=request_timeout,
  File "/home/admin/essproc/venv/lib/python3.6/site-packages/elastic_transport/_node/_http_urllib3.py", line 199, in perform_request
    raise err from None
elastic_transport.ConnectionTimeout: Connection timeout caused by: ReadTimeoutError(HTTPConnectionPool(host='10.44.0.48', port=9200): Read timed out. (read timeout=9.999699419997341))

Our client is configured like so

ES.Elasticsearch(
            "http://esmasters:9200", sniff_on_connection_fail=True,
            sniff_on_start=True, min_delay_between_sniffing=600,
            request_timeout=600, sniff_timeout=300,
            max_retries=5, retry_on_timeout=True)

Despite having a large timeout (600), the ReadTimeOutError mentions the timeout as 10.

Is there some problem with how we are configuring the client?

Also I noticed that elastic_transport would not retry the timeouts despite configuring the client like above. I had to tinker with the installed elastic_transport in the virtualenv by manually changing the default value for retry_on_timeout to True in the __init__ method of Transport class in elastic_transport/_transport.py.

Can anyone shed some light on why I am seeing this behaviour?

My code uses elasticsearch.helpers.scan to perform long running search requests. In elasicsearch-py v 8.0 and above, the scan() method uses ElasticSearch.options() to set the additional options, including timeouts.

Whether by design or due to a bug, the options() method instantiates a copy of the ElasticSearch client but drops the timeout/retries related parameters.

I have filed a bug report for this. As a workaround, I am explicitly setting timeouts on the scan() method but I cannot set the retry parameters there.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.