Hi,
tl;dr: Elasticsearch ingest nodes are exhibiting frequent readiness probe failures e.g. {"timestamp": "2021-04-19T09:58:01+00:00", "message": "readiness probe failed", "curl_rc": "28"}
. We would like to understand why this happens?
Our usecase is purely log ingestion with the ELKB stack, the setup consists of a 7.12.0 ES cluster orchestrated by ECK 1.5.0 on K8S 1.18.9 on AWS. The ES cluster consists of the following nodeSets:
- master-{a,b,c}
- 1 nodeSet per availability zone, 1 node per nodeSet
- EC2
c5.4xlarge
type
- data-hot-{a,b,c}
- 1 nodeSet per AZ, 10 nodes per nodeSet
- EC2
i3en.2xlarge
type
- coordinating-{a,b,c}
- 1 nodeSet per availability zone, 1 node per nodeSet
- EC2
m5.2xlarge
type
- ingest-{a,b,c}
- 1 nodeSet per availability zone, 1 node per nodeSet
- EC2
m5.2xlarge
type
Or in detail: Elasticsearch on ECK setup · GitHub
We have a logstash 7.11.1 "indexer" that feeds this cluster from kafka via the elasticsearch output. The ES output is configured to send data to a non-ECK configured logging-prod-es-ingest
Service (see above) that connects to the ingest
nodes (everything else connects to the coordinating nodes via the original Service).
Our normal production indexing flow is ~40-60k (including index replicas), sent in by 6 replicas of the aforementioned indexer (logstash). During our performance tests we reached a peak indexing throughput of ~200k DPS, so we know the current setup is capable of much more.
Under such indexing load, the ingest nodes exhibit very frequent readiness probe failures, up to 72 times per hour:
$ kubectl logs logging-prod-es-ingest-a-0 --tail 10000 | grep "readiness probe failed" | grep "2021-04-19T11" | wc -l
72
Or more verbose:
{"timestamp": "2021-04-19T11:54:11+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T11:54:16+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T11:55:41+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T11:58:51+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T11:58:56+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T11:59:31+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:00:41+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:02:21+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:02:36+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:03:11+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:03:16+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:04:06+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:06:56+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:07:36+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:07:46+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:07:51+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:08:01+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:09:21+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:09:56+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:10:06+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:10:36+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:12:01+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:12:11+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:12:26+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:14:16+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:14:21+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:17:51+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:18:51+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:18:56+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:19:31+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"timestamp": "2021-04-19T12:20:11+00:00", "message": "readiness probe failed", "curl_rc": "28"}
{"type": "server", "timestamp": "2021-04-19T12:20:17,021Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491120}{false}{false}{false}}] of size [34966] on [Netty4TcpChannel{localAddress=/10.4.11.132:36922, remoteAddress=10.4.78.211/10.4.78.211:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,022Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491136}{false}{false}{false}}] of size [1650] on [Netty4TcpChannel{localAddress=/10.4.11.132:36922, remoteAddress=10.4.78.211/10.4.78.211:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,022Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491195}{false}{false}{false}}] of size [74151] on [Netty4TcpChannel{localAddress=/10.4.11.132:36922, remoteAddress=10.4.78.211/10.4.78.211:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,022Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491240}{false}{false}{false}}] of size [8272] on [Netty4TcpChannel{localAddress=/10.4.11.132:36922, remoteAddress=10.4.78.211/10.4.78.211:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,024Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491119}{false}{false}{false}}] of size [14220] on [Netty4TcpChannel{localAddress=/10.4.11.132:58278, remoteAddress=10.4.3.191/10.4.3.191:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,024Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491142}{false}{false}{false}}] of size [4930] on [Netty4TcpChannel{localAddress=/10.4.11.132:58278, remoteAddress=10.4.3.191/10.4.3.191:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,024Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491148}{false}{false}{false}}] of size [14673] on [Netty4TcpChannel{localAddress=/10.4.11.132:58278, remoteAddress=10.4.3.191/10.4.3.191:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,024Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491211}{false}{false}{false}}] of size [12301] on [Netty4TcpChannel{localAddress=/10.4.11.132:58278, remoteAddress=10.4.3.191/10.4.3.191:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,029Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491089}{false}{false}{false}}] of size [21716] on [Netty4TcpChannel{localAddress=/10.4.11.132:36408, remoteAddress=10.4.73.130/10.4.73.130:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,030Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491101}{false}{false}{false}}] of size [32805] on [Netty4TcpChannel{localAddress=/10.4.11.132:36408, remoteAddress=10.4.73.130/10.4.73.130:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"type": "server", "timestamp": "2021-04-19T12:20:17,030Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491149}{false}{false}{false}}] of size [39488] on [Netty4TcpChannel{localAddress=/10.4.11.132:36408, remoteAddress=10.4.73.130/10.4.73.130:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }
{"timestamp": "2021-04-19T12:21:41+00:00", "message": "readiness probe failed", "curl_rc": "28"}
During this time, the ingest nodes are using between 2-5 cores out of 7 requested from k8s, 8 available on the worker node.
Taking a look at a node with VisualVM, the maximum GC usage is around 0.3% with G1GC. VisualVM also indicated that threads were not being blocked.
We would like to understand why Elasticsearch is unable to respond to the readiness probes (a simple request to /
) within the 3 second timeout so often. If possible, we would like to avoid raising the probe timeout, as we believe that would only hide the underlying problem. We have tried attaching the Elastic APM java agent to the ES JVM, but that did not work (and possibly not supported?) and lowering log level for org.elasticsearch
to DEBUG
but that didn't reveal any smoking guns either.
Any suggestions on how to debug further or gather more information are welcome.
Thank you.
Cheers,
János