My 7.17.10 cluster is hosted in AWS EKS and is managed by ECK. It appears to top out at around 90k documents indexed per second (including replicas) per second and I haven't been able to identify the bottleneck. Adding more hot-tier nodes hasn't improved the indexing rate, and doubling the number of clients sending bulk index requests also hasn't meaningfully changed it. During peak loads I sometimes see 35k/s ingest, which exceeds the 90k bottleneck since there are 2 replicas.
Cluster sizing:
- 12 hot i3.2xlarge, 7.8 cpu, 55gi mem, 35g heap (experimenting with >50% for heap, which seemed to help a little)
- 5 warm d3.2xlarge, 7 cpu, 60 gi mem, 30g heap
- 22 cold d3.2xlarge, 7 cpu, 60 gi mem, 30g heap
- 3 master as pods w/ 4 cpu, 28gi mem, 22gi heap
This is a multi-tenant cluster covering hundreds of accounts (es+kibana are not customer-facing, only used for storage and searching through the api). Each customer account has two bootstrap indices for different types of storage, with an ILM policy for warm/cold/delete. The current ILM policy has rotation after 40g primary shard size or 7d, then warm for 6d (with force merge + RO), then cold for 88 before deletion. We currently have ~8.8k primary shards, with each tier staying under 20 shards per GB of heap (much lower than that for the hot tier). Storage is 194tb.
There are some other notes in this post about the initial setup.
During peak load a backlog forms, though the write queue doesn't exceed 150. Write timeouts occur infrequently, and refresh utilization climbs.
Things I've tried without seeing much of a change:
- add metrics to the steps before elasticsearch to verify that the issue is either within elasticsearch or at least from the point it is sent to the java rest high level client
- increase the hot tier by 50%
- double the number of client nodes sending batches
- double the bulk size, and halve the bulk size
- increase refresh rate from 30s to 45s on one set of indices, and to 90s on a less important set
- increase primary shard count for the most active indices to spread the write load around more
- enable compression for client connections in case ec2 bandwidth throttling occurred (I don't see evidence of this though)
- increase maxConnTotal and maxConnPerRoute for the httpClientBuilder
- try increasing indices.memory.index_buffer_size
I'm not sure if this is cpu, disk, network, or cluster coordination costs. Increasing the hot nodes didn't appear to improve the ingest rate, which suggests that cpu/disk aren't the bottleneck, though the hot nodes do run high. Moving the hot tier from i3.2xlarge to m-series for more cores might work, but using EBS would introduce latency and cut my IOPS significantly.
The only warning logs I'm seeing are infrequent slow transport message logs. In the past 24h I've only seen them from hot-8. Samples:
From hot-8 to hot-2
sending transport message [Response{25611586}{false}{false}{false}{class org.elasticsearch.action.bulk.BulkShardResponse}] of size [20723] on [Netty4TcpChannel{localAddress=/172.16.4.49:9300, remoteAddress=/172.16.21.255:51360, profile=default}] took [5005ms] which is above the warn threshold of [5000ms] with success [true]
From hot-8 to hot-1
sending transport message [Request{indices:admin/seq_no/retention_lease_background_sync[r]}{63707977}{false}{false}{false}] of size [586] on [Netty4TcpChannel{localAddress=/172.16.4.49:50176, remoteAddress=172.16.21.17/172.16.21.17:9300, profile=default}] took [5005ms] which is above the warn threshold of [5000ms] with success [true]
sending transport message [Request{indices:admin/seq_no/global_checkpoint_sync[r]}{63707946}{false}{false}{false}] of size [387] on [Netty4TcpChannel{localAddress=/172.16.4.49:34570, remoteAddress=172.16.12.17/172.16.12.17:9300, profile=default}] took [5205ms] which is above the warn threshold of [5000ms] with success [true]
sending transport message [Request{indices:data/write/bulk[s][r]}{63671760}{false}{false}{false}] of size [1192199] on [Netty4TcpChannel{localAddress=/172.16.4.49:50068, remoteAddress=172.16.21.17/172.16.21.17:9300, profile=default}] took [5208ms] which is above the warn threshold of [5000ms] with success [true]
Any recommendations on troubleshooting this further?