We have recently upgraded our Elasticsearch to 8.1.0.
We have a some Python scripts which perform some searching/computations on data in the Elasticsearch cluster which use elasticsearch-py
. Due to systems loads, during early days of deployment (around Elasticsearch-7.9) we noticed timeouts getting hit. So we configured our clients to have larger timeouts and also more retries. The scripts worked without issues till 7.17.0
After update to 8.1.0, we have notied a lot more timeouts in the scripts. We have not client changed configuration between upgrades. Recently we enabled elastic_transport
logging to see if more details could be found.
We found the following error there
2022-03-28 16:18:27,468 WARNI [elastic_transport.node_pool][_node_pool.mark_dead] Node <Urllib3HttpNode(http://10.44.0.48:9200)> has failed for 1 times in a row, putting on 1 second timeout
2022-03-28 16:18:27,468 WARNI [elastic_transport.transport][_transport.perform_request] Retrying request after failure (attempt 0 of 5)
Traceback (most recent call last):
File "/home/admin/essproc/venv/lib/python3.6/site-packages/elastic_transport/_transport.py", line 334, in perform_request
request_timeout=request_timeout,
File "/home/admin/essproc/venv/lib/python3.6/site-packages/elastic_transport/_node/_http_urllib3.py", line 199, in perform_request
raise err from None
elastic_transport.ConnectionTimeout: Connection timeout caused by: ReadTimeoutError(HTTPConnectionPool(host='10.44.0.48', port=9200): Read timed out. (read timeout=9.999699419997341))
Our client is configured like so
ES.Elasticsearch(
"http://esmasters:9200", sniff_on_connection_fail=True,
sniff_on_start=True, min_delay_between_sniffing=600,
request_timeout=600, sniff_timeout=300,
max_retries=5, retry_on_timeout=True)
Despite having a large timeout (600
), the ReadTimeOutError
mentions the timeout as 10
.
Is there some problem with how we are configuring the client?
Also I noticed that elastic_transport
would not retry the timeouts despite configuring the client like above. I had to tinker with the installed elastic_transport
in the virtualenv by manually changing the default value for retry_on_timeout
to True
in the __init__
method of Transport
class in elastic_transport/_transport.py
.
Can anyone shed some light on why I am seeing this behaviour?