Hi,
We are seeing intermittent elastic_transport.ConnectionTimeout: Connection timed out errors from our Django application to Elastic Cloud.
Environment:
- Django app running on Kubernetes (EKS)
- uWSGI, 1 process per pod
- Elasticsearch client libraries:
- elasticsearch-dsl==8.17.1
- django-elasticsearch-dsl (custom fork)
- elastic-apm==6.26.1
- Elastic Cloud in eu-west-1 (v8.17.3)
- App also sends APM data to Elastic Cloud
Symptom:
- Very random timeouts, often noticed in the morning between 6-8 UTC range
- The same endpoint/request usually succeeds if retried a few seconds later
- The timeout is raised in the app as:
elastic_transport.ConnectionTimeout: Connection timed out
Example failing request:
GET /api/v2.7/units/2418/global-search/?search_query_string=store%20t&page_size=50&page=1
Example app log:
- request failed after about 60 seconds with HTTP 500
- retry seconds later succeeded in milliseconds
Example logs around the same time:
- Elasticsearch request timeout:
ConnectionTimeout: Connection timed out - APM send failure:
Unable to reach APM Server: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
Important detail:
Elastic support checked the cluster and said they only see normal 200 responses, no matching server-side errors/rejections/499s, and cluster metrics look healthy.
That is why this currently looks more like a client/network/path issue than an Elasticsearch query-performance issue.
App/client details:
- Current ES client timeout is 60 seconds
- We are using the shared
elasticsearch-dslconnection in long-running Django/uWSGI workers - Some helper code also creates direct
Elasticsearch(...)clients - No explicit
retry_on_timeout/max_retriesconfigured yet
Infrastructure details:
- App runs on EKS worker nodes
- Current workloads are scheduled on spot nodes
- No obvious Elasticsearch cluster health issue seen from the Elastic side
- Failures are transient and same request works immediately after
Questions:
- Does this pattern usually point to stale pooled HTTP connections / keep-alive reuse on the client side?
- Have others seen this kind of behavior with Django/uWSGI long-running workers talking to Elastic Cloud?
- Are there recommended
elastic_transportdebug logs to enable temporarily to distinguish:- connect timeout
- TLS handshake issue
- stale reused socket
- upstream network idle timeout / load balancer idle timeout
- Are there recommended client settings for Elastic Cloud in this situation:
request_timeoutretry_on_timeoutmax_retries- connection pool tuning
- Is there anything on the Elastic Cloud side that could produce this symptom without showing obvious request failures in server logs?
Any guidance on how to prove whether this is:
- client connection reuse issue
- Kubernetes node/network issue
- DNS/TLS issue
would be very helpful.
Thanks.
This is our ES-DSL setup below:
# ELASTICSEARCH
elasticsearch_url = "REDACTED"
elasticsearch_port = "9243"
ELASTICSEARCH_DSL = {
"default": {
"hosts": f"https://elastic:REDACTED@{elasticsearch_url}:{elasticsearch_port}",
"timeout": 60, # Custom timeout
},
}
# Used to specify which delimiter to use for attribute paths
ATTRIBUTES_PATH_DELIMITER = "_"
TALENT_POOL_PATH_DELIMITER = "_"
# Default document settings
ELASTICSEARCH_DSL_INDEX_SETTINGS = {
"number_of_shards": 1,
"analysis": {
"analyzer": {
"path_analyzer": {
"tokenizer": "path_tokenizer",
},
},
"tokenizer": {
"path_tokenizer": {
"type": "path_hierarchy",
"delimiter": ATTRIBUTES_PATH_DELIMITER,
"reverse": "true",
},
},
},
}
# Set to False to globally disable auto-syncing.
ELASTICSEARCH_DSL_AUTOSYNC = os.environ.get("ELASTICSEARCH_DSL_AUTOSYNC", "True").lower() in ("true", "1")
# Set to False to not force an index refresh with every save.
ELASTICSEARCH_DSL_AUTO_REFRESH = os.environ.get("ELASTICSEARCH_DSL_AUTO_REFRESH", "True").lower() in ("true", "1")
# Class used to handle Django’s signals and keep the search index up-to-date.
ELASTICSEARCH_DSL_SIGNAL_PROCESSOR = os.environ.get(
"ELASTICSEARCH_DSL_SIGNAL_PROCESSOR",
"django_elasticsearch_dsl.signals.CelerySignalProcessor",
)
# Run indexing (populate and rebuild) in parallel using ES’ parallel_bulk() method.
ELASTICSEARCH_DSL_PARALLEL = False
And our requirements.txt for ES packages:
# Elasticsearch
# For Elasticsearch 7.0 and later, use the major version 7 (7.x.y) of the library
elasticsearch-dsl==8.17.1
# django-elasticsearch-dsl==8.0.0, we needed use a custom version of this package to handle SoftDeleteManager
# together with CeleryProcessor
#django-elasticsearch-dsl-drf==0.22.5
# unmaintained elasticsearch causes issues
git+https://github.com/millerf/django-elasticsearch-dsl-drf@fix/aggs-proxy-import-error
elastic-apm==6.26.1
git+https://github.com/mojob/mojob-django-elasticsearch-dsl.git@8.0.9#egg=django-elasticsearch-dsl
