A single-threaded script I'm working on is experiencing repeated read timeout errors from Elasticsearch. This is happening when trying to bulk-index data to an index with 12 primary and 0 replicas shards across 4 nodes. Calls to the _bulk endpoint are frequently timing out after 60 seconds with this error.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
    six.raise_from(e, None)
  File "<string>", line 3, in raise_from
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 462, in _make_request
    httplib_response = conn.getresponse()
  File "/usr/local/lib/python3.10/http/client.py", line 1375, in getresponse
    response.begin()
  File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
    version, status, reason = self._read_status()
  File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
  File "/usr/local/lib/python3.10/socket.py", line 717, in readinto
    return self._sock.recv_into(b)
TimeoutError: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py", line 255, in perform_request
    response = self.pool.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/usr/local/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 469, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 358, in _raise_timeout
    raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch-master', port=9200): Read timed out. (read timeout=60)
One _bulk in particular kept timing out and retrying continuously for over an hour. I tried repeating this call later and it succeeded in a matter of seconds, sending 687199 bytes of data in a single POST request.
A later /<index_name>/_search call which had previously been working returned within 100ms with a strangely nonspecific 503 error I haven't seen before.
Traceback (most recent call last):
  ...(script/our application code)...
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/utils.py", line 347, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/__init__.py", line 1821, in search
    return self.transport.perform_request(
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 464, in perform_request
    raise e
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 427, in perform_request
    status, headers_response, data = connection.perform_request(
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py", line 291, in perform_request
    self._raise_error(response.status, raw_data)
  File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/base.py", line 328, in _raise_error
    raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.TransportError: TransportError(503, 'search_phase_execution_exception', None)
I am trying to figure out what happened to cause these errors. Since the request that was repeatedly timing out finished later within seconds, I'm guessing ES was running out of some resource or another, but CPU usage was low during the period of repeated timeouts and none of the nodes appears to have run out of memory at any point during the run. And what does the 503 error mean?


