Elasticsearch 8.9.1 indexing bottleneck on i3.2xlarge and d3.2xlarge nodes in EKS using ECK

Last May I asked a similar question about performance in 7.17.10, but ran out of things to try and the discussion was auto-closed due to inactivity. I'll avoid repeating the same context info from that post (the initial post has important details on sharding strategy)

I have since upgraded to 8.9.1 (won the fight and didn't have to move to OpenSearch). Sadly I'm still having similar issues. The indexing rate tops out at ~100k/s (including replicas), with an average of 88.2k/s over the past 15min.

I'm also seeing up to 2s when writing a single document. Most data is written in batches of 600-1000 documents based on the data type, but for the least frequent data type a single write is appropriate. The 2s response time is surprising.

Current cluster sizing:

  • 10 hot i3.2xlarge, 7.8 cpu, 55gi mem, 35g heap
  • 6 warm d3.2xlarge, 7 cpu, 60 gi mem, 30g heap
  • 20 cold d3.2xlarge, 7 cpu, 60 gi mem, 30g heap
  • 3 master as pods w/ 4 cpu, 28gi mem, 22gi heap

In that cluster there are:

  • 6258 indices
  • 17514 total shards (I know this is high - it's split by tenant)
  • 101 billion documents
  • 191.3 TB

Changes since I asked in the earlier thread (other than the upgrade):

  • increased parallelism to have more concurrent writes
    • currently 70 parallel jobs, but it dynamically sizes based on ingest rate
    • this increased the write queue size, but only on some nodes

Output from GET _cluster/settings/?flat_settings=true

  "persistent": {
    "cluster.routing.allocation.cluster_concurrent_rebalance": "30",
    "cluster.routing.allocation.disk.watermark.low": "85%",
    "indices.breaker.request.limit": "30%",
    "indices.recovery.max_bytes_per_sec": "100mb"
  "transient": {}

For cluster.routing.allocation.cluster_concurrent_rebalance I tried the default value, 10, 30, and 50 to see if it allowed the cluster to redistribute shards more evenly with the new heuristics in es 8. I haven't seen a difference in ingestion rates or work distribution. Similar notes on indices.recovery.max_bytes_per_sec.

The write load is very unevenly distributed. Hot tier nodes have

EC2 monitoring stats for the past 3h shows very uneven distribution, with the busiest instance around 96% cpu, and the quietest at ~33%.

Indexing rate by node is also very unbalanced:

hot-3 has an average write queue of 80, and the next busiest (hot-1) only has 10.

With 8.9 I would have expected it to rebalance using write load a bit more than it is. Do you have any recommendations on tuning the allocation, or troubleshooting the indexing bottleneck in general?

A few questions that don't appear to have been covered in the last thread:

  1. How many shards do each of the content/hot indices have?
    • While I've seen the 8.x balance improvements help out in places, there are some "limitations" on what it can do if it is just one index with one shard consuming a majority of resources.
    • For reference, when I setup a high-throughput index, I'll generally have the content/hot state have a number of shards (generally some nice multiple of the number of hot/content nodes). Then when the index moves to warm, I'll shrink the shard count to a more meaningful size.
      • This generally from my experience helps ensure that write loads get more evenly distributed across nodes.
  2. Your hot node memory is a bit weird, these nodes have the least amount of memory then your other data nodes. I believe (and this is somewhat of an educated assumption), hot nodes generally benefit from both HEAP memory, as well as file system memory.
    • The i3 series I know has these weird memory sets, AWS now has the i4i instance type which now has much more standard memory sizes.
    • If you can't go with the i4i, maybe look at the i4g instance type. If you can support ARM instances, I've found that Elasticsearch performs perfectly fine using ARM.
  3. I personally think the memory/heap on your master nodes is overkill, but I don't think this should have a negative impact on performance (though I don't really know this)
  4. How are you routing requests to your cluster?
    • Might be bit of a controversial take, but at the scale, I'd might consider adding some coordinating only nodes in front of all these nodes.
  5. In your previous post you mentioned this cluster is managed by ECK. Could you provide the CR that defines this cluster?
    • I've written about some general improvements that can be made when running Elasticsearch via ECK, here, that I found while doing load testing for a relatively medium-sized project (3 nodes, sustaining at ~30k e/s, 1 index, 3 shards + 1 (3) replicas)
  6. I personally avoid touching most of the Elasticsearch cluster settings, I think the majority of these are tuned to work in the majority of scenarios, and it can quickly become a footgun to change these as they can have some weird side-effects.
  7. Do you do any sort of document pre-processing either via Logstash or Ingest pipelines?
    • In you last post you mention:

    No ingest pipeline within elasticsearch - I have a separate pipeline that predates our use of elasticsearch.

    • Did you ever confirm that your "separate pipeline" was not causing these limitations?
  8. In the last post, I didn't really see any mention of what your index mappings really look like, are we talking just keywords and numbers or are we also dealing with things like text and analyzers.
    • Could you provide a copy of your indexs' mappings?
  9. Entering the world of "really new", and probably not the first thing you should be looking at, but have you looked at _synthetic source or TSDS?
    • These both offer some advantages to performance, but do come with trade-offs/limitations.

Note: I'm probably missing some other questions, but wanted to somewhat avoid rehashing questions that were answered in the previous post, as (I'm assuming) these were already re-evaluated after the upgrade.

The write-load balancing only works for data streams, and apparently requires an Enterprise subscription (it's part of "Advanced cluster rebalancing"). If it's not working as expected and you have an Enterprise subscription then please open a support case.

Ah, I missed this part of the docs:

The estimated write load of the new shard is a weighted average of the actual write loads of recent shards in the data stream. Shards that do not belong to the write index of a data stream have an estimated write load of zero.

The docs could use a note that it requires an Enterprise subscription btw.

Thank you for the suggestions - just got back from holiday and am catching up.

It varies based on ingest. Most have a single primary shard and two replicas, but the top 12 indices have 2 primary shards to spread the load around a bit more. Approximate rates for the top few over the past 30 according to stack monitoring (per second): 2200, 1900, 1700, 800, 750, 600, 600, 550, 530, 500.

I've been cautious in expanding that since shards per node can also be an issue, especially with some smaller indices for smaller customers.

I had been experimenting with different heap sizes in the last version, and it seemed to perform a little better, but I've reverted to the default of half to see how 8.9 behaves.

The i3 series is a bit odd with 8:61:1900 for cpu:mem:disk, so I like that i4i+i4g have 8:64:1875. Wish they also had an m-series equivalent with 16:64:1900 to see how it behaves with more cores. The im4gn has 16:64:7500, which is too big of a storage jump for what I need. I'll experiment with i4g since it has some nice performance improvements and a good price.

Ingest is written to a headless k8s service targeting the hot tier nodes.
Queries are to a pair of coordinating nodes to offload aggregation costs and avoid hitting inappropriate nodes.

Thank you, reviewing that post.

Yeah, I tend to avoid them too. The cluster rebalancing settings are the only area I think might benefit - with a large number of shards it seems strange to only allow a few shards to be relocating at any time unless there's a a high coordination cost. If there were 500 nodes then I'd expect it to support more relocating shards than if there were 5 nodes. Similar for the max bytes per second - that seems like it's partially based on bandwidth, and partially based on cpu/disk pressure on each node. I'm not sure which values are appropriate for my environment though.

I don't use logstash at all, and don't use ingest pipelines. Everything is processed in my separate pipeline, with storage to elasticsearch, followed by storage to S3 for archives once I have the elasticsearch IDs from the request.

That pipeline has more parallelism now and the write queue has increased, so I suspect that my main issue is still the uneven work distribution. Indexing rate by node right now is 9.2k/s to 13.6k/s, and storage is between 41.8% and 64.3%. Write queues tend to be higher on the busier nodes of course.

Some text - it's for logging data, with some fields captured as keywords/numbers/IPs, others as text, or text+keyword.

Mapping for the busiest index, pulled from dev tools (the others will be about the same, but pulled from dev tools to cover any that were not explicitly mapped)

TSDS seems to only be for metrics data, so it wouldn't be appropriate for this use case. The synthetic source looks interesting, but since it's in technical preview I'll wait for it to mature a bit.

Thanks for the additional info, so notes I've come up with:

Approximate rates for the top few over the past 30 according to stack monitoring (per second): 2200, 1900, 1700, 800, 750, 600, 600, 550, 530, 500.

I've been cautious in expanding that since shards per node can also be an issue, especially with some smaller indices for smaller customers.

I'd agree here that for really low volume indices, expanding shards might not be worth it. You can probably try expanding shards on the higher volume indices in the hope that they get scheduled more evenly, but I'm honestly not sure if that is a "good" solution in this scenario.

Regarding the ECK deployment you provided:

I see:

        - preference:
            - key: failure-domain.beta.kubernetes.io/zone
              operator: In
              - us-east-2a

A few times, I'm assuming based off this, that the entire cluster is only in one AZ (us-east-2a)?

I see you're using some plugins:

bin/elasticsearch-plugin install --batch mapper-size repository-s3

For mapper-size, I'm not sure if this plugin has any affect on indexing performance. I wasn't really able to find much related to performance discussions on this plugin.

For repository-s3, I don't think you need to install this anymore, I believe that since 8.0, this plugin is now part of Elasticsearch. (Doesn't affect your performance, just a random note).

You're setting:

- config:
    indices.memory.index_buffer_size: 10%

But the default is already 10%, this can probably be removed.

You should investigate

- config:
    http.compression: true

I don't know if this setting is helping, hurting, or doing nothing to you. If network bandwidth isn't a concern, doing uncompressed is ?probably? better? Might be worth testing.

This is interesting

- config:
    - data_warm
    - data_content
  name: warm
- config:
    - data_cold
    - data_content
  name: cold

I have honestly never considered using the data_content role on any other node type than either dedicated nodes, or sharing with data_hot. I don't know if there are any performance implications here, I do know that data_content stores indices like the security (permissions) ones, so I don't know if them being on "slower" nodes could impact ingest performance.

I wasn't able to find much info/guidance on this topic, so not 100% sure.

- config:
    - remote_cluster_client

I don't think you need this role on all of your nodes, and in reality, if you're not using the Cross Cluster features, I don't believe you need it at all.

Your master node storage is a bit "weird":

- name: leader3
    ephemeral-storage: 10Gi
    ephemeral-storage: 10Gi
  - metadata:
      name: elasticsearch-data
      - ReadWriteOnce
          storage: 1Gi
      storageClassName: gp2-wait-retain-ebs

I don't fully understand what this is doing, but a small recommendation, GP2 is a bit old and slow now, might want to test switching to GP3. I don't think master nodes do a whole lot of I/O that would bottleneck a cluster, but I'm also not 100% sure.

Just a quick mention, but since you've already somewhat stated it,

- env:
  - name: ES_JAVA_OPTS
    value: -Xms22g -Xmx22g -Xlog:gc*,gc+age=trace,safepoint:file=logs/gc.log:utctime,pid,tags:filecount=2,filesize=10m

I don't really know what any of these Java Opts do, so not sure how they might affect your performance.

Regarding the Index mapping you provided, there is one main thing I can see:

Lots of text fields. text fields are relatively expensive for indexing and, I personally, try to avoid them unless absolutely nessecary.

  • When I do need to use a text field, I see if I can get away with using a match_only_text field first.

The synthetic source looks interesting, but since it's in technical preview I'll wait for it to mature a bit.

Yes, it is somewhat unfortunate it is not GA yet, limits me in using it as well.

Unfortunately. The business prioritized savings on the cross-zone data transfer and the additional instances for now.

Good call, will remove since I'm not currently using it.

Will experiment with reverting that change to see if it makes a difference. It didn't seem to make a meaningful difference when I enabled it.

The warm+cold tier are on d3 instances that don't see as much activity since the data aged out and querying for that data is much less frequent. The warm tier runs ~half the cpu as hot, and cold is about 1/4 the cpu load of warm. I'd expect these to still perform well, but I don't know if the indices in data_content would be affected.

These are actually gp3 since I migrated gp2 instances to gp3, but didn't update the nodeset to replace the leaders with the new storage class. The stats for the past week look ok to me - mostly idle, low iops, low throughput in the kib/s, not mib/s.

Xms and Xmx are min/max heap, the remainder is for logging gc info (from a previous support request iirc)

Agreed that dropping some text fields would help, though I'll have to review the consumer of this data to see where it expects the .keyword variant. Changing the mapping can be complicated when readers have expectations, and with 90d of retention it takes awhile to age an old dataset out.

I've made some progress.

The i4g nodes helped, and I've increased the parallelism on my pipeline further. Most hot nodes are now running near max cpu, with an average write queue of 5-80 for each hot tier node. My mean indexing rate is now ~35k/s for primaries, max ~50k/s.

I still see some delays writing cluster state on the warm tier nodes, e.g.

writing cluster state took [23224ms] which is above the warn threshold of [10s]; [skipped writing] global metadata, wrote [0] new mappings, removed [0] mappings and skipped [653] unchanged mappings, wrote metadata for [0] new indices and [1] existing indices, removed metadata for [0] indices and skipped [5840] unchanged indices

Any thoughts on how to address that?

Also, any thoughts on the appropriate cluster_concurrent_rebalance and max_bytes_per_sec for these instance types?

This suggests IO saturation, but it's not really a problem as long as these nodes are not master-eligible.

Nothing concrete unfortunately. Historically we've recommended keeping cluster_concurrent_rebalance low but I think in 8.9.1 that's not really necessary any more, we just haven't done any structured investigation into its tuning. The best value for max_bytes_per_sec is a function of your hardware, workload, and performance goals - the higher you set it, the bigger the performance impact of shard recoveries on other work.

This seems to only be happening on the warm nodes.
Log frequency last night (warm-5 is the one this occurs on the most)

EC2 metrics for those nodes for that time period:

EC2 metrics for the node running warm-5 show a much higher disk read, so that tracks. It was doing an S3 snapshot around that time, so I'd guess that the two are related.

Any thoughts on how the concurrent rebalance should scale as cluster size increases? My thinking was that if you have 50 nodes you can probably handle more concurrent rebalances than if you have 5 nodes (though a per-node limit would still apply).

For the max_bytes_per_sec I'm currently set to 100mb, but I'm not sure what would be reasonable for this cluster, which is using d3.2xlarge for warm+cold and i4g.2xlarge for the hot tier.

I would set that differently for the different nodes. IIRC d3 nodes have spinning disks which are terrible at concurrent IO workloads so you need to keep the speed right down there, but i4g noes should be able to go much faster.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.