Elasticsearch readiness probe failures

Hi,

tl;dr: Elasticsearch ingest nodes are exhibiting frequent readiness probe failures e.g. {"timestamp": "2021-04-19T09:58:01+00:00", "message": "readiness probe failed", "curl_rc": "28"}. We would like to understand why this happens?

Our usecase is purely log ingestion with the ELKB stack, the setup consists of a 7.12.0 ES cluster orchestrated by ECK 1.5.0 on K8S 1.18.9 on AWS. The ES cluster consists of the following nodeSets:

  • master-{a,b,c}
    • 1 nodeSet per availability zone, 1 node per nodeSet
    • EC2 c5.4xlarge type
  • data-hot-{a,b,c}
    • 1 nodeSet per AZ, 10 nodes per nodeSet
    • EC2 i3en.2xlarge type
  • coordinating-{a,b,c}
    • 1 nodeSet per availability zone, 1 node per nodeSet
    • EC2 m5.2xlarge type
  • ingest-{a,b,c}
    • 1 nodeSet per availability zone, 1 node per nodeSet
    • EC2 m5.2xlarge type

Or in detail: Elasticsearch on ECK setup · GitHub

We have a logstash 7.11.1 "indexer" that feeds this cluster from kafka via the elasticsearch output. The ES output is configured to send data to a non-ECK configured logging-prod-es-ingest Service (see above) that connects to the ingest nodes (everything else connects to the coordinating nodes via the original Service).

Our normal production indexing flow is ~40-60k (including index replicas), sent in by 6 replicas of the aforementioned indexer (logstash). During our performance tests we reached a peak indexing throughput of ~200k DPS, so we know the current setup is capable of much more.

Under such indexing load, the ingest nodes exhibit very frequent readiness probe failures, up to 72 times per hour:


$ kubectl logs logging-prod-es-ingest-a-0 --tail 10000 | grep "readiness probe failed" | grep "2021-04-19T11" | wc -l

72

Or more verbose:


{"timestamp": "2021-04-19T11:54:11+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T11:54:16+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T11:55:41+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T11:58:51+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T11:58:56+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T11:59:31+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:00:41+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:02:21+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:02:36+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:03:11+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:03:16+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:04:06+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:06:56+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:07:36+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:07:46+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:07:51+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:08:01+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:09:21+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:09:56+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:10:06+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:10:36+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:12:01+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:12:11+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:12:26+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:14:16+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:14:21+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:17:51+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:18:51+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:18:56+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:19:31+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"timestamp": "2021-04-19T12:20:11+00:00", "message": "readiness probe failed", "curl_rc": "28"}

{"type": "server", "timestamp": "2021-04-19T12:20:17,021Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491120}{false}{false}{false}}] of size [34966] on [Netty4TcpChannel{localAddress=/10.4.11.132:36922, remoteAddress=10.4.78.211/10.4.78.211:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,022Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491136}{false}{false}{false}}] of size [1650] on [Netty4TcpChannel{localAddress=/10.4.11.132:36922, remoteAddress=10.4.78.211/10.4.78.211:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,022Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491195}{false}{false}{false}}] of size [74151] on [Netty4TcpChannel{localAddress=/10.4.11.132:36922, remoteAddress=10.4.78.211/10.4.78.211:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,022Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491240}{false}{false}{false}}] of size [8272] on [Netty4TcpChannel{localAddress=/10.4.11.132:36922, remoteAddress=10.4.78.211/10.4.78.211:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,024Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491119}{false}{false}{false}}] of size [14220] on [Netty4TcpChannel{localAddress=/10.4.11.132:58278, remoteAddress=10.4.3.191/10.4.3.191:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,024Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491142}{false}{false}{false}}] of size [4930] on [Netty4TcpChannel{localAddress=/10.4.11.132:58278, remoteAddress=10.4.3.191/10.4.3.191:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,024Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491148}{false}{false}{false}}] of size [14673] on [Netty4TcpChannel{localAddress=/10.4.11.132:58278, remoteAddress=10.4.3.191/10.4.3.191:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,024Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491211}{false}{false}{false}}] of size [12301] on [Netty4TcpChannel{localAddress=/10.4.11.132:58278, remoteAddress=10.4.3.191/10.4.3.191:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,029Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491089}{false}{false}{false}}] of size [21716] on [Netty4TcpChannel{localAddress=/10.4.11.132:36408, remoteAddress=10.4.73.130/10.4.73.130:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,030Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491101}{false}{false}{false}}] of size [32805] on [Netty4TcpChannel{localAddress=/10.4.11.132:36408, remoteAddress=10.4.73.130/10.4.73.130:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"type": "server", "timestamp": "2021-04-19T12:20:17,030Z", "level": "WARN", "component": "o.e.t.OutboundHandler", "cluster.name": "logging-prod", "node.name": "logging-prod-es-ingest-a-0", "message": "sending transport message [MessageSerializer{Request{indices:data/write/bulk[s]}{325491149}{false}{false}{false}}] of size [39488] on [Netty4TcpChannel{localAddress=/10.4.11.132:36408, remoteAddress=10.4.73.130/10.4.73.130:9300, profile=default}] took [5003ms] which is above the warn threshold of [5000ms]", "cluster.uuid": "csGGHrw0SEeSHOuyZtrr2Q", "node.id": "wdc_q_8dQkasmjEdlOajiw" }

{"timestamp": "2021-04-19T12:21:41+00:00", "message": "readiness probe failed", "curl_rc": "28"}

During this time, the ingest nodes are using between 2-5 cores out of 7 requested from k8s, 8 available on the worker node.

Taking a look at a node with VisualVM, the maximum GC usage is around 0.3% with G1GC. VisualVM also indicated that threads were not being blocked.

We would like to understand why Elasticsearch is unable to respond to the readiness probes (a simple request to /) within the 3 second timeout so often. If possible, we would like to avoid raising the probe timeout, as we believe that would only hide the underlying problem. We have tried attaching the Elastic APM java agent to the ES JVM, but that did not work (and possibly not supported?) and lowering log level for org.elasticsearch to DEBUG but that didn't reveal any smoking guns either.

Any suggestions on how to debug further or gather more information are welcome.

Thank you.

Cheers,

János

Hi @Janos_Csorvasi, thanks for your question and all the details you've already posted.

Couple additional questions/notes:

  1. Are you able to confirm whether the probe failures correlate with high indexing load? I understand this only happens on ingest nodes, so that's a clue. Is the number of failures increasing with the load?
  2. For indexing, how the numbers you've posted (40-60k and 200k) relate to number of API calls to Elasticsearch? I assume you are using bulk APIs to some degree. I wonder if failures could be correlated to the number of API calls instead of DPS being indexed.
  3. What are the response times and failure rates of API calls when probes are failing? In the logs you've posted I see some 5s time outs, so I assume these are not related to probes.
  4. Is the number of failures somewhat even between all ingest nodes?
  5. I don't think this is the culprit here, but it's worth saying that while the probe is a simple request (like you've noted) it does require authentication which requires querying an index.

Thanks,
David

Hi @dkow,

Thank you for taking a look at my question and reading through all that! :slight_smile:

Here are some answers/further questions:

  1. I would say so, yes. Without any indexing going through the ingest nodes, there are no probe failures. The failures only happen on ingest nodes only, indeed, there are none exhibited by the rest. The number of failures seems to be increasing with the incoming load, yes.
  2. For the number of API calls to Elasticsearch, how do we measure that? Do you have a concrete metric in mind? (We're using the prometheus exporter for gathering metrics). We use bulk requests exclusively when feeding the cluster with data. Logstash is configured with pipeline.batch.size: 6000 and pipeline.workers: 14 with 6 instances of LS. Regarding the number of API calls, from testing out different values for the aforementioned 2 settings, when we have lower batch sizes (presumably meaning more batches, more API calls) indexing throughput goes down considerably (from 60k to 20k with pipeline.batch.size: 500) and ingest node CPU usage is maxed out.
  3. We cannot answer this at the moment, as I don't believe we have metrics for these in either Logstash or Elasticsearch.
  4. They are more-or-less even, yes
  5. I think you're onto something here. We've done CPU sampling with VisualVM to try and see where most of the time is spent. We've also gotten the hot_threads output from the ingest nodes and we think authorization might be the culprit here.
$ curl -s -k "https://elastic:PW@localhost:9200/_nodes/logging-prod-es-ingest-jmx-a-0/hot_threads" | grep "cpu usage by thread" -A 3
   100.1% (500.6ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#8]'
     10/10 snapshots sharing following 304 elements
       org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
       org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$authorizeAction$5(AuthorizationService.java:286)
--
   100.1% (500.6ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#3]'
     4/10 snapshots sharing following 305 elements
       org.elasticsearch.xpack.security.authz.RBACEngine.resolveAuthorizedIndicesFromRole(RBACEngine.java:536)
       org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
--
   100.1% (500.5ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#2]'
     10/10 snapshots sharing following 304 elements
       org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
       org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$authorizeAction$5(AuthorizationService.java:286)

Which lead us to Improve Authorization performance in clusters with a large number of indices · Issue #67987 · elastic/elasticsearch · GitHub and Stop Security's heavy authz and audit harming the cluster's stability · Issue #68004 · elastic/elasticsearch · GitHub which we believe is actually the root cause for us. Our cluster at the moment contains 2545 indices and 8324 active shards. Since a bulk request can contain up to 6000 documents, it is likely that each request hits hundreds or thousands of indices and Elasticsearch spends all that CPU time authorizing requests on the transport_worker threads and thus blocking other incoming requests.

(edit for clarification: the issue seems to significantly limit indexing throughput in this setup) This kind of makes x-pack security unusable for large clusters (our current v5 production cluster is several times larger than this, we're looking at upgrading now). Do you know if there is traction on the above GitHub issues? Also, do you know if it's possible to enable TLS in Elasticsearch without authorization?

Thank you very much.

Cheers,

János

Hi @Janos_Csorvasi, I'm glad you were able to advance your investigation, it does look like this could be the culprit.

Do you know if there is traction on the above GitHub issues?

I don't :slight_smile: I think it would be the best to ask the issue author.

Also, do you know if it's possible to enable TLS in Elasticsearch without authorization?

Would you be looking for something like this? But that might be a question suited better for Elasticsearch part of the forum.

Going back to your original concern - if you feel you understand the underlying issue well enough, maybe you could consider temporarily raising your timeout for the probe? I understand that right now it causes issues with Pods falling out of logging-prod-es-ingest endpoints, so this could allow to improve the stability of that Service while you work on the performance issues.

Thanks,
David

Hi @dkow,

I did inquire in parallel in the GitHub issue, no response so far. :frowning:

Would you be looking for something like this ?

We did consider that, but did not end up trying it. Our reasoning was that with anonymous access, ES still ties a role to the anonymous user and thus authorization is still performed (only authentication is bypassed).

We ended up destroying this cluster in question and rebuilt it with x-pack security disabled for now, so we can move forward. (Thank you ECK for making this a super quick and painless process! :heart: ) With a similar infrastructure we were able to again reach a peak of 215k DPS ingestion when backfilling the cluster with data, but this time the CPU usage on the ingest nodes never went above 1 core per node (previously up to 5 cores per node used with a peak of 60k DPS ingestion). With security disabled, there have been 0 readiness probe failures on the ingest nodes (with the timeout untouched).

Could we somehow escalate this issue? I imagine this is present in most medium-to-large size Elasticsearch installations and either presents itself as cluster instability or wasted resources. (I understand the relation to ECK is only due to the increased visibility provided by the probe and k8s itself).

Thank you.

Cheers,

János