Elasticsearch 7.17.10 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS

There will be one connection per kinesis shard actually since the KCL (kinesis client library) allocates shards evenly across the workers. We have two streams currently to handle two types of data, with 29 shards in one and 33 in the other, so we have 62 total shards, and 62 threads actively writing to Elasticsearch.

I experimented with doubling the number of workers in case there was a local bottleneck, e.g. if the elasticsearch client used a shared http connection pool with a tiny pool size. If the ec2 instances had bandwidth throttling then I'd expect to see an improvement with more client pods. Doubling the number of workers had no effect on the overall ingestion rate, and those workers don't run very high CPU.

Even this seems fairly low, given that each of your 12 hot nodes has 8 CPUs available for indexing, and also given that some of the time within ES is spent on overheads like network comms.

AIUI changing the number of workers wouldn't have affected the number of requests which Elasticsearch is processing concurrently, because this is determined by the number of Kinesis shards. Is that right? That seems like the limiting factor to me at the moment.

1 Like

Exactly.

I increased my shard count for one stream from 29 to 40 to see whether the increased parallelism improved throughput. In the ~5h since that change I haven't seen a dramatic difference, though I won't have a proper comparison until everything in the old shards has been processed.

The merge throttle time spiked shortly after I made that change (based on the prometheus exporter metrics), but the average merge throttle time since then is about the same as what I normally see during peak load. The refresh time is also about as high as usual (one node is reporting more refresh time than the actual time elapsed - possibly a reporting issue of course).

How much should I care about merge throttle time and refresh time?

Refresh time accumulates per thread, so can definitely exceed the elapsed time.

Not a lot, at least not while your ES utilisation is otherwise so low. Yet what is your refresh strategy? Are you refreshing on every write for instance? That's normally inappropriate if you are trying to achieve high throughput.

The refresh is set to 40s and 90s for the two types of indices. It's configured in the index template (and the index settings updated directly after a change). It had been 30s, but I increased it a month or two ago in case the reduced refresh helped reduce load on the cluster.

Wrote a quick script to check this. Most indices only have a single shard per node, but one (the busiest) has 2 shards on 3 nodes. hot-0,9,10 each have one primary and one replica (of different shards in that index). I'll reduce the primary shard count for that index to ensure only one per node. The affected nodes aren't in the top 2 for cpu, but it's tidier this way and avoids concentrating the load.

Hmm in which case I'm a little surprised you're spending so much time on refreshes. Do your refresh stats confirm that you're not inadvertently getting more refreshes from somewhere? Check total and external_total from GET /_stats?human&filter_path=indices.*.total.refresh for instance.

I hadn't come across the external refresh stats before - thank you.

I'm seeing some high numbers on the external refresh for some indices, e.g. this index has the highest total, which has 36,061,735 documents in it:

"refresh" : {
          "total" : 44806,
          "total_time" : "1.9h",
          "total_time_in_millis" : 6841120,
          "external_total" : 44120,
          "external_total_time" : "1.9h",
          "external_total_time_in_millis" : 6875400,
          "listeners" : 0
        }

And this one has the highest external_total_time_in_millis:

"refresh" : {
          "total" : 5972,
          "total_time" : "3.3h",
          "total_time_in_millis" : 12200181,
          "external_total" : 5040,
          "external_total_time" : "3.3h",
          "external_total_time_in_millis" : 12170869,
          "listeners" : 0

I'm reviewing the code for anything that requests a refresh, but not finding anything so far. Are there any common ways that people accidentally request refresh?

You can do it as a separate request or by order it as part of an indexing request. If you are performing frequent updates to documents I believe this can also result in a large number of refreshes.

We bulk index using the java elasticsearch-rest-high-level-client 7.17.10, with client = RestHighLevelClient in the following simplified code:

var req = new BulkRequest();
req.timeout(TimeValue.timeValueMinutes(1));
req.waitForActiveShards(2);

for (var r : alerts) {
    req.add(
        new IndexRequest(index)
            .create(true)
            .source(r.getJson(), XContentType.JSON)
    );
}
client.bulk(req, RequestOptions.DEFAULT)

It then checks the response for hasFailures, iterating any failures to report them.
Finally it iterates the BulkItemResponse to get the IndexResponse, which it pulls the elasticsearch-generated id from.

Anything on this path that could trigger an external refresh?

Other services that interact with elasticsearch:

  • Our web interface, which doesn't appear to trigger external refresh
    • I tested this by querying an index that only had 2 external refreshes in every way the system supports, and the external refresh did not increase
  • A cost monitoring service that only calls the following:
    • /_cat/shards
    • ExplainLifecycleRequest
  • The elasticsearch prometheus exporter
  • kibana (but very rarely used - the performance is poor with our shard count)

I've been reviewing the code for each service and haven't found any explicit refreshes, and have also been reviewing the source code for the elasticsearch rest client in case there's something I missed. No luck so far.

The absolute numbers are not so interesting, it's cumulative (or at least it's the total of all the per-shard cumulative stats) so it's best to look at how it increases over a known interval. E.g. 44k refreshes would be totally reasonable for a 3-shard index that's been refreshing every 40s for a week. The question is whether it is increasing by more than #shards every 40/90s.

External refreshes are either indexing with the ?refresh=true option, or calls to POST /$INDEX/_refresh.

The refresh numbers match reasonably well. I adjusted my script to capture the index creation time and calculate the elapsed time divided by either 45 or 90, depending on the index type.

One example with the 45s refresh and 1 pri + 2 replica had 38812 refreshes, with 42669 expected (the refresh rate changed in the past month so I expect it to be off by a bit). That index had 37842 external refreshes though, which I'm having trouble identifying the cause for.

I switched to only watching primaries and ran the stats collection job every second to watch when it changed - looks like ~46s.

Watching the primary shard refresh for an index with 90s refresh is a bit different. I saw that refresh count go up after 99s, then 120s. The external refresh increased for both of those, though I saw one just before it where total increased and external did not.

This implies that the external refresh is happening on the refresh interval. Any chance the stats are just being reported wrong, or that there's a path that causes the refresh interval to be reported as external? Perhaps when an index can't be refreshed within a timeout period it triggers the external refresh path?

I'm not finding any code that explicitly sets refresh on either the indexing or searching path.

We treat Elasticsearch as an immutable store and don't do any updates.

Oh sorry, periodic refreshes are external too. I meant in terms of API calls.

No worries. It was an interesting tangent to explore and I learned something.

Increased the shard count by 50% again to see if there was any measurable impact, and the throughput hasn't changed. This suggests that the kinesis shard processing isn't our bottleneck.

Has there been any change in the utilisation of the write threadpool, or is it still as low as it was previously?

Also if you do GET /_tasks?actions=indices:data/write/bulk&detailed how many tasks are returned? If you save the response to tasks.json you can find this out with cat tasks.json | jq '.nodes[].tasks[]' -cMr | wc -l.

Here's the past 7 days of indexing rate (blue, left), vs write queue (orange, right). There's a small vertical line showing when I increased kinesis shards by 50% on the 22nd. The write queue has increased, though indexing isn't higher.
We did have a much higher ingest rate yesterday for an unrelated reason, so the shard change isn't the only factor.

100 tasks from that query

Ok that does look like an improvement, the queue was often empty prior to the change which is a sure sign of under-utilisation. But are all nodes now queueing up writes or is the queue still concentrated on a subset of the nodes?

That's still a little on the low side given you have ~100 CPUs in your hot tier, but perhaps not so bad.

Could you share the output of GET _nodes/data_hot:true/hot_threads?threads=9999 while the cluster is trying to process a backlog?

Also I don't think you ever looked at TCP buffer utilisation which would help to show if the holdup was either in the network or on the ES transport threads: Try ss -tonie on all the hot nodes and look for connections with big buffers.