V8.17.3 ES Cloud: Random morning ConnectionTimeout to Elastic Cloud from Django/Kubernetes, same request succeeds seconds later

Hi,

We are seeing intermittent elastic_transport.ConnectionTimeout: Connection timed out errors from our Django application to Elastic Cloud.

Environment:

  • Django app running on Kubernetes (EKS)
  • uWSGI, 1 process per pod
  • Elasticsearch client libraries:
    • elasticsearch-dsl==8.17.1
    • django-elasticsearch-dsl (custom fork)
    • elastic-apm==6.26.1
  • Elastic Cloud in eu-west-1 (v8.17.3)
  • App also sends APM data to Elastic Cloud

Symptom:

  • Very random timeouts, often noticed in the morning between 6-8 UTC range
  • The same endpoint/request usually succeeds if retried a few seconds later
  • The timeout is raised in the app as:
    elastic_transport.ConnectionTimeout: Connection timed out

Example failing request:

  • GET /api/v2.7/units/2418/global-search/?search_query_string=store%20t&page_size=50&page=1

Example app log:

  • request failed after about 60 seconds with HTTP 500
  • retry seconds later succeeded in milliseconds

Example logs around the same time:

  • Elasticsearch request timeout:
    ConnectionTimeout: Connection timed out
  • APM send failure:
    Unable to reach APM Server: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Important detail:
Elastic support checked the cluster and said they only see normal 200 responses, no matching server-side errors/rejections/499s, and cluster metrics look healthy.

That is why this currently looks more like a client/network/path issue than an Elasticsearch query-performance issue.

App/client details:

  • Current ES client timeout is 60 seconds
  • We are using the shared elasticsearch-dsl connection in long-running Django/uWSGI workers
  • Some helper code also creates direct Elasticsearch(...) clients
  • No explicit retry_on_timeout / max_retries configured yet

Infrastructure details:

  • App runs on EKS worker nodes
  • Current workloads are scheduled on spot nodes
  • No obvious Elasticsearch cluster health issue seen from the Elastic side
  • Failures are transient and same request works immediately after

Questions:

  1. Does this pattern usually point to stale pooled HTTP connections / keep-alive reuse on the client side?
  2. Have others seen this kind of behavior with Django/uWSGI long-running workers talking to Elastic Cloud?
  3. Are there recommended elastic_transport debug logs to enable temporarily to distinguish:
    • connect timeout
    • TLS handshake issue
    • stale reused socket
    • upstream network idle timeout / load balancer idle timeout
  4. Are there recommended client settings for Elastic Cloud in this situation:
    • request_timeout
    • retry_on_timeout
    • max_retries
    • connection pool tuning
  5. Is there anything on the Elastic Cloud side that could produce this symptom without showing obvious request failures in server logs?

Any guidance on how to prove whether this is:

  • client connection reuse issue
  • Kubernetes node/network issue
  • DNS/TLS issue
    would be very helpful.

Thanks.

This is our ES-DSL setup below:

# ELASTICSEARCH

elasticsearch_url = "REDACTED"

elasticsearch_port = "9243"

ELASTICSEARCH_DSL = {

"default": {

"hosts": f"https://elastic:REDACTED@{elasticsearch_url}:{elasticsearch_port}",

"timeout": 60,  # Custom timeout

    },

}

# Used to specify which delimiter to use for attribute paths

ATTRIBUTES_PATH_DELIMITER = "_"

TALENT_POOL_PATH_DELIMITER = "_"

# Default document settings

ELASTICSEARCH_DSL_INDEX_SETTINGS = {

"number_of_shards": 1,

"analysis": {

"analyzer": {

"path_analyzer": {

"tokenizer": "path_tokenizer",

            },

        },

"tokenizer": {

"path_tokenizer": {

"type": "path_hierarchy",

"delimiter": ATTRIBUTES_PATH_DELIMITER,

"reverse": "true",

            },

        },

    },

}

# Set to False to globally disable auto-syncing.

ELASTICSEARCH_DSL_AUTOSYNC = os.environ.get("ELASTICSEARCH_DSL_AUTOSYNC", "True").lower() in ("true", "1")

# Set to False to not force an index refresh with every save.

ELASTICSEARCH_DSL_AUTO_REFRESH = os.environ.get("ELASTICSEARCH_DSL_AUTO_REFRESH", "True").lower() in ("true", "1")

# Class used to handle Django’s signals and keep the search index up-to-date.

ELASTICSEARCH_DSL_SIGNAL_PROCESSOR = os.environ.get(

"ELASTICSEARCH_DSL_SIGNAL_PROCESSOR",

"django_elasticsearch_dsl.signals.CelerySignalProcessor",

)

# Run indexing (populate and rebuild) in parallel using ES’ parallel_bulk() method.

ELASTICSEARCH_DSL_PARALLEL = False

And our requirements.txt for ES packages:


# Elasticsearch

# For Elasticsearch 7.0 and later, use the major version 7 (7.x.y) of the library

elasticsearch-dsl==8.17.1

# django-elasticsearch-dsl==8.0.0, we needed use a custom version of this package to handle SoftDeleteManager

# together with CeleryProcessor

#django-elasticsearch-dsl-drf==0.22.5

# unmaintained elasticsearch causes issues 


git+https://github.com/millerf/django-elasticsearch-dsl-drf@fix/aggs-proxy-import-error

elastic-apm==6.26.1

git+https://github.com/mojob/mojob-django-elasticsearch-dsl.git@8.0.9#egg=django-elasticsearch-dsl

Hi,

Unfortunately I don't have any immediate solution to the problem, but let me try to get a better understanding of it with some questions.

The example failing request that you are showing has a URL that I do not recognize. I assume this is a URL that your Django server listens to? What happens in the handler for this URL? What is the actual Elasticsearch request that times out? Is it always the same, or does this happen with random Elasticsearch endpoints?

Elastic support checked the cluster and said they only see normal 200 responses

I'm not surprised about this. The errors that you are getting are connection timeouts. We do not know why, but the server isn't receiving these connections at all, and for that reason they timeout. You will only see these errors in the client.

I don't really have much to go on, but I'll try to answer your questions:

  1. Really hard for me to say with the little information that I have. If you were able to capture the request URLs that fail vs. those that succeed maybe we can draw some conclusions.
  2. I do not know of any other similar issues, but if you suspect your workers could be getting into a bad state after running for too long maybe a good way to test this is to restart them at regular periods? It is unclear to me from your description if the retry that succeeds just a few seconds after the failure is issued from the same worker, or from a different worker, and if it goes to the same Elasticsearch node or a different node. Knowing this may help to come up with some theories.
  3. If you can afford to increase the log volume, then logging the underlying HTTP client that you are using could be useful. This would be urllib3 if you are using the default choice of client.

I also have a question of my own. Have you investigated if this could be an issue on the EKS side? If you are losing networking then that would also present as timeouts, and is consistent with the fact that the server has no idea you have sent these requests.

Lastly, one word about retries. You could enable max_retries and see if that helps. If the requests are timing out after 60 seconds, then maybe putting 5-10 retries will allow you ride this issue and eventually get a connection? The version of the client that you are using issues the retries back to back, without a delay in between them. I have recently added a retry backoff feature that we found to be useful in some cases, as it extends the retry period with these pauses, allowing more time for the issue to resolve itself. But this is only in the 9.4.1 client, our most recent one.

Miguel

Sorry for the late reply @Miguel_Grinberg.

What helped us was a solution proposed by an Elasticsearch support engineer: adding KeepAliveNode as the node_class for the Elasticsearch connection. Lowering the timeout and configuring max_retries also helped. The exact configuration is included in the snippet below.

Since applying these changes, we haven’t experienced the connection timeout issues that previously occurred in the mornings.

Our suspicion was that during the evenings, when the system had little or no traffic, some of the Elasticsearch connections from our Kubernetes pods became stale. Then, when traffic picked up again in the morning, the first requests would fail with connection timeout errors.

import socket
from elastic_transport import Urllib3HttpNode

class KeepAliveNode(Urllib3HttpNode):
    def __init__(self, config):
        super().__init__(config)
        self.pool.conn_kw["socket_options"] = [
            (socket.IPPROTO_TCP, socket.TCP_NODELAY, 1),
            (socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1),
            (socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, 60),
            (socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, 10),
            (socket.IPPROTO_TCP, socket.TCP_KEEPCNT, 5),
        ]

ELASTICSEARCH_DSL = {
    "default": {
        "hosts": f"``https://elastic``:<password>@{elasticsearch_url}:{elasticsearch_port}",
        "timeout": 60,
        "retry_on_timeout": True,
        "max_retries": 3,
        "node_class": KeepAliveNode,
    },
}