A single-threaded script I'm working on is experiencing repeated read timeout errors from Elasticsearch. This is happening when trying to bulk-index data to an index with 12 primary and 0 replicas shards across 4 nodes. Calls to the _bulk
endpoint are frequently timing out after 60 seconds with this error.
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 467, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 462, in _make_request
httplib_response = conn.getresponse()
File "/usr/local/lib/python3.10/http/client.py", line 1375, in getresponse
response.begin()
File "/usr/local/lib/python3.10/http/client.py", line 318, in begin
version, status, reason = self._read_status()
File "/usr/local/lib/python3.10/http/client.py", line 279, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/usr/local/lib/python3.10/socket.py", line 717, in readinto
return self._sock.recv_into(b)
TimeoutError: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py", line 255, in perform_request
response = self.pool.urlopen(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 799, in urlopen
retries = retries.increment(
File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 525, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/usr/local/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise
raise value
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 715, in urlopen
httplib_response = self._make_request(
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 469, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 358, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='elasticsearch-master', port=9200): Read timed out. (read timeout=60)
One _bulk
in particular kept timing out and retrying continuously for over an hour. I tried repeating this call later and it succeeded in a matter of seconds, sending 687199 bytes of data in a single POST request.
A later /<index_name>/_search
call which had previously been working returned within 100ms with a strangely nonspecific 503 error I haven't seen before.
Traceback (most recent call last):
...(script/our application code)...
File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/utils.py", line 347, in _wrapped
return func(*args, params=params, headers=headers, **kwargs)
File "/usr/local/lib/python3.10/site-packages/elasticsearch/client/__init__.py", line 1821, in search
return self.transport.perform_request(
File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 464, in perform_request
raise e
File "/usr/local/lib/python3.10/site-packages/elasticsearch/transport.py", line 427, in perform_request
status, headers_response, data = connection.perform_request(
File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/http_urllib3.py", line 291, in perform_request
self._raise_error(response.status, raw_data)
File "/usr/local/lib/python3.10/site-packages/elasticsearch/connection/base.py", line 328, in _raise_error
raise HTTP_EXCEPTIONS.get(status_code, TransportError)(
elasticsearch.exceptions.TransportError: TransportError(503, 'search_phase_execution_exception', None)
I am trying to figure out what happened to cause these errors. Since the request that was repeatedly timing out finished later within seconds, I'm guessing ES was running out of some resource or another, but CPU usage was low during the period of repeated timeouts and none of the nodes appears to have run out of memory at any point during the run. And what does the 503 error mean?