Elasticsearch readiness probe failures

Janos_Csorvasi · April 20, 2021, 8:48am

Thank you for taking a look at my question and reading through all that!

Here are some answers/further questions:

I would say so, yes. Without any indexing going through the ingest nodes, there are no probe failures. The failures only happen on ingest nodes only, indeed, there are none exhibited by the rest. The number of failures seems to be increasing with the incoming load, yes.
For the number of API calls to Elasticsearch, how do we measure that? Do you have a concrete metric in mind? (We're using the prometheus exporter for gathering metrics). We use bulk requests exclusively when feeding the cluster with data. Logstash is configured with pipeline.batch.size: 6000 and pipeline.workers: 14 with 6 instances of LS. Regarding the number of API calls, from testing out different values for the aforementioned 2 settings, when we have lower batch sizes (presumably meaning more batches, more API calls) indexing throughput goes down considerably (from 60k to 20k with pipeline.batch.size: 500) and ingest node CPU usage is maxed out.
We cannot answer this at the moment, as I don't believe we have metrics for these in either Logstash or Elasticsearch.
They are more-or-less even, yes
I think you're onto something here. We've done CPU sampling with VisualVM to try and see where most of the time is spent. We've also gotten the hot_threads output from the ingest nodes and we think authorization might be the culprit here.

$ curl -s -k "https://elastic:PW@localhost:9200/_nodes/logging-prod-es-ingest-jmx-a-0/hot_threads" | grep "cpu usage by thread" -A 3
   100.1% (500.6ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#8]'
     10/10 snapshots sharing following 304 elements
       org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
       org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$authorizeAction$5(AuthorizationService.java:286)
--
   100.1% (500.6ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#3]'
     4/10 snapshots sharing following 305 elements
       org.elasticsearch.xpack.security.authz.RBACEngine.resolveAuthorizedIndicesFromRole(RBACEngine.java:536)
       org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
--
   100.1% (500.5ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#2]'
     10/10 snapshots sharing following 304 elements
       org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
       org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$authorizeAction$5(AuthorizationService.java:286)

Which lead us to Improve Authorization performance in clusters with a large number of indices · Issue #67987 · elastic/elasticsearch · GitHub and Stop Security's heavy authz and audit harming the cluster's stability · Issue #68004 · elastic/elasticsearch · GitHub which we believe is actually the root cause for us. Our cluster at the moment contains 2545 indices and 8324 active shards. Since a bulk request can contain up to 6000 documents, it is likely that each request hits hundreds or thousands of indices and Elasticsearch spends all that CPU time authorizing requests on the transport_worker threads and thus blocking other incoming requests.

(edit for clarification: the issue seems to significantly limit indexing throughput in this setup) This kind of makes x-pack security unusable for large clusters (our current v5 production cluster is several times larger than this, we're looking at upgrading now). Do you know if there is traction on the above GitHub issues? Also, do you know if it's possible to enable TLS in Elasticsearch without authorization?

Thank you very much.

Cheers,

János

Topic		Replies	Views
Default Elasticsearch ECK Installation stuck on "readiness probe failed" Elastic Cloud on Kubernetes (ECK) docker , painless	2	2587	March 15, 2023
Get "readiness probe failed" error when deploy to K8S Elasticsearch docker	1	852	November 8, 2021
Readiness probe failed: Waiting for elasticsearch cluster to become ready Elasticsearch	2	4568	June 4, 2021
Bug Report : Memory leak when readiness probe Curls to check the cluster health when security is enabled Elasticsearch	24	2142	September 18, 2019
Readiness probe failed: Error: Got HTTP code 503 but expected a 200 Kibana elastic-stack-monitoring	3	337	March 11, 2025

Elasticsearch readiness probe failures

Related topics