V8.17.3 ES Cloud: Random morning ConnectionTimeout to Elastic Cloud from Django/Kubernetes, same request succeeds seconds later

Hi,

We are seeing intermittent elastic_transport.ConnectionTimeout: Connection timed out errors from our Django application to Elastic Cloud.

Environment:

  • Django app running on Kubernetes (EKS)
  • uWSGI, 1 process per pod
  • Elasticsearch client libraries:
    • elasticsearch-dsl==8.17.1
    • django-elasticsearch-dsl (custom fork)
    • elastic-apm==6.26.1
  • Elastic Cloud in eu-west-1 (v8.17.3)
  • App also sends APM data to Elastic Cloud

Symptom:

  • Very random timeouts, often noticed in the morning between 6-8 UTC range
  • The same endpoint/request usually succeeds if retried a few seconds later
  • The timeout is raised in the app as:
    elastic_transport.ConnectionTimeout: Connection timed out

Example failing request:

  • GET /api/v2.7/units/2418/global-search/?search_query_string=store%20t&page_size=50&page=1

Example app log:

  • request failed after about 60 seconds with HTTP 500
  • retry seconds later succeeded in milliseconds

Example logs around the same time:

  • Elasticsearch request timeout:
    ConnectionTimeout: Connection timed out
  • APM send failure:
    Unable to reach APM Server: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Important detail:
Elastic support checked the cluster and said they only see normal 200 responses, no matching server-side errors/rejections/499s, and cluster metrics look healthy.

That is why this currently looks more like a client/network/path issue than an Elasticsearch query-performance issue.

App/client details:

  • Current ES client timeout is 60 seconds
  • We are using the shared elasticsearch-dsl connection in long-running Django/uWSGI workers
  • Some helper code also creates direct Elasticsearch(...) clients
  • No explicit retry_on_timeout / max_retries configured yet

Infrastructure details:

  • App runs on EKS worker nodes
  • Current workloads are scheduled on spot nodes
  • No obvious Elasticsearch cluster health issue seen from the Elastic side
  • Failures are transient and same request works immediately after

Questions:

  1. Does this pattern usually point to stale pooled HTTP connections / keep-alive reuse on the client side?
  2. Have others seen this kind of behavior with Django/uWSGI long-running workers talking to Elastic Cloud?
  3. Are there recommended elastic_transport debug logs to enable temporarily to distinguish:
    • connect timeout
    • TLS handshake issue
    • stale reused socket
    • upstream network idle timeout / load balancer idle timeout
  4. Are there recommended client settings for Elastic Cloud in this situation:
    • request_timeout
    • retry_on_timeout
    • max_retries
    • connection pool tuning
  5. Is there anything on the Elastic Cloud side that could produce this symptom without showing obvious request failures in server logs?

Any guidance on how to prove whether this is:

  • client connection reuse issue
  • Kubernetes node/network issue
  • DNS/TLS issue
    would be very helpful.

Thanks.

This is our ES-DSL setup below:

# ELASTICSEARCH

elasticsearch_url = "REDACTED"

elasticsearch_port = "9243"

ELASTICSEARCH_DSL = {

"default": {

"hosts": f"https://elastic:REDACTED@{elasticsearch_url}:{elasticsearch_port}",

"timeout": 60,  # Custom timeout

    },

}

# Used to specify which delimiter to use for attribute paths

ATTRIBUTES_PATH_DELIMITER = "_"

TALENT_POOL_PATH_DELIMITER = "_"

# Default document settings

ELASTICSEARCH_DSL_INDEX_SETTINGS = {

"number_of_shards": 1,

"analysis": {

"analyzer": {

"path_analyzer": {

"tokenizer": "path_tokenizer",

            },

        },

"tokenizer": {

"path_tokenizer": {

"type": "path_hierarchy",

"delimiter": ATTRIBUTES_PATH_DELIMITER,

"reverse": "true",

            },

        },

    },

}

# Set to False to globally disable auto-syncing.

ELASTICSEARCH_DSL_AUTOSYNC = os.environ.get("ELASTICSEARCH_DSL_AUTOSYNC", "True").lower() in ("true", "1")

# Set to False to not force an index refresh with every save.

ELASTICSEARCH_DSL_AUTO_REFRESH = os.environ.get("ELASTICSEARCH_DSL_AUTO_REFRESH", "True").lower() in ("true", "1")

# Class used to handle Django’s signals and keep the search index up-to-date.

ELASTICSEARCH_DSL_SIGNAL_PROCESSOR = os.environ.get(

"ELASTICSEARCH_DSL_SIGNAL_PROCESSOR",

"django_elasticsearch_dsl.signals.CelerySignalProcessor",

)

# Run indexing (populate and rebuild) in parallel using ES’ parallel_bulk() method.

ELASTICSEARCH_DSL_PARALLEL = False

And our requirements.txt for ES packages:


# Elasticsearch

# For Elasticsearch 7.0 and later, use the major version 7 (7.x.y) of the library

elasticsearch-dsl==8.17.1

# django-elasticsearch-dsl==8.0.0, we needed use a custom version of this package to handle SoftDeleteManager

# together with CeleryProcessor

#django-elasticsearch-dsl-drf==0.22.5

# unmaintained elasticsearch causes issues 


git+https://github.com/millerf/django-elasticsearch-dsl-drf@fix/aggs-proxy-import-error

elastic-apm==6.26.1

git+https://github.com/mojob/mojob-django-elasticsearch-dsl.git@8.0.9#egg=django-elasticsearch-dsl