Read timeout errors

A single-threaded script I'm working on is experiencing repeated read timeout errors from Elasticsearch. This is happening when trying to bulk-index data to an index with 12 primary and 0 replicas shards across 4 nodes. Calls to the _bulk endpoint are frequently timing out after 60 seconds with this error.

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 462, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.10/socket.py", line 717, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py", line 255, in perform_request
    response = self.pool.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 469, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 358, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch-master', port=9200): Read timed out. (read timeout=60)

One _bulk in particular kept timing out and retrying continuously for over an hour. I tried repeating this call later and it succeeded in a matter of seconds, sending 687199 bytes of data in a single POST request.

A later /<index_name>/_search call which had previously been working returned within 100ms with a strangely nonspecific 503 error I haven't seen before.

Traceback (most recent call last):
  ...(script/our application code)...
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/utils.py", line 347, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/__init__.py", line 1821, in search
    return self.transport.perform_request(
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 464, in perform_request
    raise e
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 427, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py", line 291, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/base.py", line 328, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.TransportError: TransportError(503, 'search_phase_execution_exception', None)

I am trying to figure out what happened to cause these errors. Since the request that was repeatedly timing out finished later within seconds, I'm guessing ES was running out of some resource or another, but CPU usage was low during the period of repeated timeouts and none of the nodes appears to have run out of memory at any point during the run. And what does the 503 error mean?

Thanks so much for reaching out, @dpitchford. Do you have a code sample you can share here?

Sorry, what do you mean by code sample? This script tested successfully on 7 our of 8 of our QA sites, so I don't think the code is the problem; these timeout/transport errors occur on only one site.

The read timeouts are happening on bulk calls like this:

elasticsearch.helpers.streaming_bulk(
    self.es_client,
    self.bulk_index_actions
    raise_on_error=False,
    raise_on_exception=False,
    request_timeout=60,
)

Where self.es_client is an Elasticsearch object, and self.bulk_index_actions is a list of actions like this, one for each document being indexed (up to 300):

{
    "_index": "index_name",
    "_id": document.pop("_id", None),
    "_source": document,
}

Like I said, when I tried repeating this call later, it POSTED a total of about 680 KB and completed without issue in a second or so.

The code on our side that led to the 503 transport error is a simple call to Elasticsearch.search:

return self.es_client.search(index="index_name", body=body)

Where body is a dict like this:

{
    "query": {"bool": {"must": [{"term": {"partition_id": "0123456789ab"}}]}},
    "size": 10000,
    "_source": ["revision"],
}

Is this enough of a code sample?

Hello! I maintain the Elasticsearch Python client, and can confirm that what you're doing is essentially correct. What you're seeing should not happen.

What version of Elasticsearch and the Python client are you using? The latter stack trace suggests you are still using Elasticsearch 7.x. A huge number of reliability improvements have gone over the years into Elasticsearch 8 based on issues our users have run into, so I would suggest upgrading.

That said, I do have other suggestions.

  1. We recently added guidance about HTTP client configuration that I encourage you to read as it applies to Elasticsearch 7.x as well. The existing timeout only makes matters worse when the cluster is misbehaving, as it will try to honor all requests, which can worsen the problem. Consider bumping it or removing it altogether.
  2. When running search requests, please catch the TransportError as e and call e.info() to get the full message from Elasticsearch: it will contain more details. (This only applies to version 7.x of the client, looks like this was lost in factoring in version 8.x.) You can always reraise the exception after that.
  3. If you're seeing this problem as it happens, you can capture hot threads: Nodes hot threads API | Elasticsearch Guide [8.15] | Elastic and share them here, it will help understand what the cluster is doing.

Also, you have mentioned zero replicas. While this does speed up ingestion, this also means you could lose your data at any time.

Thanks for the suggestions. The HTTP Client Configuration doc says to configure TCP keepalives if I disable the request timeout on the client. How do I do that with the Python client?

Looking at the code, TransportError.info appears to be a property, not a callable; is that correct?

I will try checking hot threads, but I doubt that high CPU usage is the problem. Here are graphs from Grafana of the 8 nodes' memory and CPU usage during the script run. The continuous timeouts start at around 10:35, after which the CPU usage drops off on all but one node, where it is lower than it was during active indexing. Memory usage on that node is somewhat high, but never reaches the 8 GB limit, and seems to increase monotonically on all the other nodes.



This isn't possible, but it should still be possible to bump the timeout to at least 300, which is the TCP_KEEPIDLE value we're considering.

Correct, sorry.

Right, but hot threads will tell us what the cluster is doing. It can help diagnose a variety of other issues (disk or network issues, for example). Although the quality of this API does depend on the Elasticsearch version (that you have not shared yet).

Oh right, sorry; the ES version is 7.17.3. We have been planning to upgrade to ES 8 at some point, but this might make it more urgent. I will try logging the ex.info, bumping the request timeout to 300s, and logging info on hot threads any time we encounter a timeout.

Update: the timeout errors did indeed stop when I increased the request timeout to 5 minutes!

On a subsequent run, the script crashed because ES appears to have crashed, and I'm not sure why. I got this error when trying to bulk-index some documents:

{'index': {'_index': 'data_explorer_search_20240924t180019z', '_type': '_doc', '_id': '62d1cad34be7a001a4a64a84', 'status': 503, 'error': {'type': 'unavailable_shards_exception', 'reason': '[data_explorer_search_20240924t180019z][9] primary shard is not active Timeout: [1m], request: [BulkShardRequest [[data_explorer_search_20240924t180019z][9]] containing [16] requests]'}}, 'partition_id': 'VkwqcE2fiZWq', 'shard_id': None, 'path': {'group': 'fmm', 'project': '62d1cac38dc190f2b40e054a'}}

I then got this error when trying to perform a search:

503 TransportError search_phase_execution_exception {'error': {'root_cause': [{'type': 'no_shard_available_action_exception', 'reason': None, 'index_uuid': '4_FhwxkaQv2f6YaQYSzRwQ', 'shard': '0', 'index': 'data_explorer_search_20240924t180019z'}], 'type': 'search_phase_execution_exception', 'reason': 'all shards failed', 'phase': 'query', 'grouped': True, 'failed_shards': [{'shard': 0, 'index': 'data_explorer_search_20240924t180019z', 'node': None, 'reason': {'type': 'no_shard_available_action_exception', 'reason': None, 'index_uuid': '4_FhwxkaQv2f6YaQYSzRwQ', 'shard': '0', 'index': 'data_explorer_search_20240924t180019z'}}]}, 'status': 503}

The ES logs from this period don't have any obvious cause of failure; there are not error-level messages and no context at all before various processes start stopping, then a series of long tracebacks.