Hi @dkow,
Thank you for taking a look at my question and reading through all that! 
Here are some answers/further questions:
- I would say so, yes. Without any indexing going through the ingest nodes, there are no probe failures. The failures only happen on ingest nodes only, indeed, there are none exhibited by the rest. The number of failures seems to be increasing with the incoming load, yes.
- For the number of API calls to Elasticsearch, how do we measure that? Do you have a concrete metric in mind? (We're using the prometheus exporter for gathering metrics). We use bulk requests exclusively when feeding the cluster with data. Logstash is configured with
pipeline.batch.size: 6000andpipeline.workers: 14with 6 instances of LS. Regarding the number of API calls, from testing out different values for the aforementioned 2 settings, when we have lower batch sizes (presumably meaning more batches, more API calls) indexing throughput goes down considerably (from 60k to 20k withpipeline.batch.size: 500) and ingest node CPU usage is maxed out. - We cannot answer this at the moment, as I don't believe we have metrics for these in either Logstash or Elasticsearch.
- They are more-or-less even, yes
- I think you're onto something here. We've done CPU sampling with VisualVM to try and see where most of the time is spent. We've also gotten the
hot_threadsoutput from the ingest nodes and we think authorization might be the culprit here.
$ curl -s -k "https://elastic:PW@localhost:9200/_nodes/logging-prod-es-ingest-jmx-a-0/hot_threads" | grep "cpu usage by thread" -A 3
100.1% (500.6ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#8]'
10/10 snapshots sharing following 304 elements
org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$authorizeAction$5(AuthorizationService.java:286)
--
100.1% (500.6ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#3]'
4/10 snapshots sharing following 305 elements
org.elasticsearch.xpack.security.authz.RBACEngine.resolveAuthorizedIndicesFromRole(RBACEngine.java:536)
org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
--
100.1% (500.5ms out of 500ms) cpu usage by thread 'elasticsearch[logging-prod-es-ingest-jmx-a-0][transport_worker][T#2]'
10/10 snapshots sharing following 304 elements
org.elasticsearch.xpack.security.authz.RBACEngine.loadAuthorizedIndices(RBACEngine.java:367)
org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$authorizeAction$5(AuthorizationService.java:286)
Which lead us to Improve Authorization performance in clusters with a large number of indices 路 Issue #67987 路 elastic/elasticsearch 路 GitHub and Stop Security's heavy authz and audit harming the cluster's stability 路 Issue #68004 路 elastic/elasticsearch 路 GitHub which we believe is actually the root cause for us. Our cluster at the moment contains 2545 indices and 8324 active shards. Since a bulk request can contain up to 6000 documents, it is likely that each request hits hundreds or thousands of indices and Elasticsearch spends all that CPU time authorizing requests on the transport_worker threads and thus blocking other incoming requests.
(edit for clarification: the issue seems to significantly limit indexing throughput in this setup) This kind of makes x-pack security unusable for large clusters (our current v5 production cluster is several times larger than this, we're looking at upgrading now). Do you know if there is traction on the above GitHub issues? Also, do you know if it's possible to enable TLS in Elasticsearch without authorization?
Thank you very much.
Cheers,
J谩nos