Increasing Throughput Until My Servers Catch Fire

I'm curious about settings that will increase my index rate, at the risk of losing data in the event of hardware corruption or similar. My datanodes are heavily CPU-constrained, so anything I can do to reduce the CPU load will cause increased throughput.

I've got a large cluster of about 80 datanodes that are processing analytical data at a rate of ~250k docs/s on the primary shards (~500k/s with replication). As expected, it's taken a significant amount of tweaking (at the cluster, node, & template level) to get the cluster able to process this many documents simultaneously. However, I'm still losing data, as the stream is probably closer to ~350k docs/s. I'm hoping to gather some insight into some of the settings I'm tweaking, as there is no realistic possibility for me to set up a staging cluster of this size for testing purposes.

What I'm hoping to achieve is basically disabling the translog

I have been investigating the translog settings, and set the durability to async, however it will still commit the translog on some sync interval. Would it be sufficient to just set the sync_interval to something greater than ~90 minutes (how long an index is written to before rolling over)? Or can I set it to zero to disable? Curious about the ramifications this may have for memory...

Also, what about soft deletes?

I'm looking to update the soft_deletes.retention_lease.period to something incredibly small, e.g. "1s".

Here's the relevant settings in the index template:

{
  "device_all" : {
    "order" : 10000,
    "version" : 30500,
    "index_patterns" : [
      "device-*"
    ],
    "settings" : {
      "index" : {
        "lifecycle" : {
          "name" : "device_rollover",
          "rollover_alias" : "device_all"
        },
        "codec" : "default",
        "allocation" : {
          "max_retries" : "10"
        },
        "mapping" : {
          "total_fields" : {
            "limit" : "200"
          }
        },
        "refresh_interval" : "30s",
        "number_of_shards" : "25",
        "translog" : {
          "sync_interval" : "600s",
          "durability" : "async"
        },
        "soft_deletes" : {
          "retention_lease" : {
            "period" : "1s"
          }
        },
        "query" : {
          "default_field" : [ ]
        },
        "unassigned" : {
          "node_left" : {
            "delayed_timeout" : "15m"
          }
        },
        "number_of_replicas" : "1"
      }
    },
...
Cluster stats:

ESv7.4.1
Using 26 nodes to bulk_insert 5k docs at a time.
6 Client Nodes -
3 Master Nodes
80 Data nodes

If securely ingesting the data is not a major concern the biggest win would probably be to set the number of replicas to 0.

Thanks Christian! I was considering pinging you directly, as you've helped me on multiple issues, both online and in person.

I didn't explain my use-case properly - apologies for that.
I want to have a backup of each index, as I can't have major gaps in my data. The issue is that I'm currently unable to index all the documents coming through. My processing engines (formerly Logstash instances) get backed up waiting for ES to complete indexing and they just have to drop documents. If I can decrease the time to index any individual document, I can increase my throughput.

I agree that disabling replication would increase my throughput, but my team understandably won't sign off on that :slight_smile: Additionally, I do need to be able to query this data, and my understanding is that the replica can be helpful for faster queries as well.

Are all data nodes of the same type? What type of storage are you using? What is the average event size of your events? Are your events immutable? Are you using nested documents or parent child?

All datanodes are the same type. SSD storage. 31GB Heap, the other 225GB is reserved for the filesystem.
Average size of a document is 900B.
Documents are immutable.
Documents are not nested nor parent-child. Documents do have nested json.

Thanks for your time and please let me know if I can provide additional helpful info.

Edit: I also recently noticed that although I've 25 shards, primaries often co-locate on the same node. I've updated "cluster.routing.allocation.balance.index": 5 to ensure primary shards are placed on different datanodes, which should reduce "hotspots" where servers bottleneck the indexing process.

If all 80 nodes perform indexing, 250k events at around 1kB each in size instinctively sounds a bit low. Have you tried to determine what is limiting performance? Do you have a lot of concurrent querying going on that consumes resources?

That's correct, all 80 datanodes do perform indexing. The cluster is performing ~150 searches/s.

Most of this search load is due to previous ES requests:

I have read the 'Tune for Indexing Speed' document, and have addressed nearly-all of the concerns addressed therein.

I'm taking your advice and disabling replication for hot indices. Upon rollover, the ILM policy will instantiate replication along with shrinking. I'll update this thread once my index rolls over in an hour or so.

edit: remove shrinking & ILM policy info seen below.

So I have ~60 shards being transferred around typically: ~50 shards (the past two indices) consolidating on nodes (pre-shrink) and ~10 shards (another two earlier indices) being redistributed (post-shrink).

GET _ilm/policy

    "version" : 20,
    "modified_date" : "2020-08-12T12:59:41.352Z",
    "policy" : {
      "phases" : {
        "warm" : {
          "min_age" : "0ms",
          "actions" : {
            "forcemerge" : {
              "max_num_segments" : 10
            },
            "set_priority" : {
              "priority" : 10
            },
            "shrink" : {
              "number_of_shards" : 5
            }
          }
        },
        "cold" : {
          "min_age" : "30d",
          "actions" : {
            "freeze" : { },
            "set_priority" : {
              "priority" : 1
            }
          }
        },
        "hot" : {
          "min_age" : "0ms",
          "actions" : {
            "rollover" : {
              "max_size" : "1000gb",
              "max_age" : "30d",
              "max_docs" : 2000000000
            },
            "set_priority" : {
              "priority" : 100
            }
          }
        },
        "delete" : {
          "min_age" : "90d",
          "actions" : {
            "delete" : { }
          }
        }
      }
    }

If the shrinking results in overhead, why not change the rollover policy to generate the indices at the desired shard size?

The reason I'm indexing with 25 shards is to spread out the indexing among many nodes in the cluster. This way each node is only responsible for indexing 10k docs/s (given a cluster index rate of 250k docs/s). Were I to reduce the number of shards to, say, 10, Each node would be responsible for indexing 25k docs/s.

The reason I shrink the index to fewer shards (5:1 currently) is to reduce overhead with the cluster state etc... When I've got 90 days of these indices at 25:1, my cluster becomes unreliable in responsiveness. What I've seen is at that point, an ILM rollover won't get acknowledged by the cluster in 30s and I'll get woken up in the middle of the night 5 hours later, having to reindex a 12TB index.

@Christian_Dahlqvist apologies, as I'd previously thought my 150 searches/s was coming from overhead at the "shrink" stage. It's actually coming from system searches against my ES indices. I might update the es-monitoring ILM to delete the monitoring indices after 2-3 days.

Does the hot threads API give any hints on where the bottleneck is? I don't expect the translog settings will have a big impact on CPU usage, committing the translog takes a bit of IO but is otherwise fairly cheap. Also soft_deletes.retention_lease.period only matters if you're overwriting or deleting docs which it doesn't sound like you're doing, and even then only in a transient failure state (either a lost CCR connection or a failed replica).

Does it help to increase your bulk sizes? 5k docs averaging 900B each is only ~5MB per bulk, spread over 25 shards that's ~200kB per shard bulk.

If each node can handle indexing 10k docs per sec and you have 80 data nodes it would seem possible to handle up to 800k docs per sec by increasing the number of shards?

Sounds like the sort of thing that's fixed by https://github.com/elastic/elasticsearch/issues/48701, I suggest upgrading to a newer version.

Maybe set index.routing.allocation.total_shards_per_node: 1 on the hot indices to force them to spread out. Remove that setting after they roll over to avoid over-constraining the allocator since I suspect that prevents shrinking. Sadly you can't do this automatically in ILM today, see https://github.com/elastic/elasticsearch/issues/44070.

Have you adjusted the recovery settings at all? It sounds like recoveries are struggling to keep up.

Maybe use a separate monitoring cluster to isolate any load it might be causing?

2 Likes

Surprisingly, disabling replication on my hot index did not help throughput on my primaries.

GET /_nodes/hot_threads

  101.3% (506.5ms out of 500ms) cpu usage by thread 'elasticsearch[es-datanode-22][[shrink-device-all-000148][0]: Lucene Merge Thread #24]'
    5/10 snapshots sharing following 12 elements
... 

A spot check of the ~7k lines shows all of the hot threads reported as these Lucene merges specifically of shrunken indices. If I'm shrinking a 2TB index to 5:1 shards in the warm phase, what's a reasonable number of segments to merge to (via ILM) per shard? It'd appear I should just forego Force Merges altogether in the warm phase.

I'll revert these settings - they didn't appear to have much effect anyways. You're correct in my use case that I'm not overwriting or deleting docs in place.

This is the next knob I need to tweak. I'll up this to 50k and see what happens.

It sounds like settings index.routing.allocation.balance.index: 5 was the correct thing to do then, as it still allows for consolidation prior to shrinking. However, after shrinking, the shards immediately redistribute to other nodes, as desired.

I have not adjusted them. I figured indices being transferred around was normal. My understanding of the workflow to shrink docs is as follows:

  1. Rollover to a new index
  2. Consolidate the entire index (primaries or replicas) on one node (pre-shrink)
  3. Reindex to the new shard layout
  4. Redistribute the shards amongst the cluster (post-shrink)

I think I should expect my indices to be transferring around during steps 2 & 4.

Updates to implement:

  • remove / revise merging during the rollover & shrink <- I could use advice here
  • increase bulk shard requests by 10x
  • remove translog/soft_delete changes I made
  • increase sharding to 40:1 (to take advantage of all 80 datanodes' indexing capacity)

Any other changes you'd attempt?

Thanks @DavidTurner for updating the metadata functionality. I have seen a few cases where the cluster state gets unwieldy. It's hard to recover a cluster like that.

This is a bit surprising. It might be worthwhile checking that it is not network performance that is limiting throughput. Are you using 10G networking?

How frequently are your hot indices rolling over? You could set them up with 40 primary shards and 1 replica, use larger bulk sizes and adjust the ILM target size to get them to the desired size without having to shrink. I would probably drop the document count threshold as well and just roll over based on size or age. Each index would roll over less frequently and therefore cover a longer time period but the total number of shards in the cluster would match what you currently get after shrinking and therefore not add to the cluster state size.

If you are forcemerging at any point I would also consider stopping this unless you keep your data for a very long time. It can add a lot of disk I/O.

1 Like

This is surprising to me too, and it's also surprising that the hot threads are mostly Lucene merge threads since this suggests that indexing itself is not actually CPU-bound as claimed in the OP. Do you see any log messages mentioning throttling indexing?

Note that this Lucene merge thread isn't doing a force-merge, it's probably just the usual merge that happens after a shrink. I'd be interested to see the whole hot threads output anyway.

Also 2TB is pretty big for your target index size with just 5 shards. Maybe roll over more frequently, or maybe shrinking isn't the right thing to do.

It is normal for shards to move around, but it is surprising that it's not keeping up better. I suspect your system would benefit from a higher value of indices.recovery.max_bytes_per_sec than the safe-but-somewhat-conservative default of 40MB/s.

Sorta, but this obviously allows up to 5 shards of the hot index on each node. A tighter bound might give you better indexing performance.

One additional thing worth trying might be to introduce a number of coordinating only nodes and run indexing through these. This may reduce the indexing traffic between the nodes which could be beneficial if network throughput is an issue.

If you have 40 problems primary shards and send a bulk request that contains data evenly spread across the shards to one of the nodes most of the data will need to be forwarded to the appropriate nodes. A coordinating only node can split the bulk correctly and reduce this traffic between data nodes.

Thanks so much for looking into all of this you two. I'll try to respond inline to your questions and concerns.

I'm using 25G networking.

They currently roll over at 2TB, which occurs in about 90 minutes.

I've done this about 12 hours ago, not much additional throughput noticed yet.

I had always assumed there's a hard limit for indices at 2B documents, but I misread - that's 2B per shard. I might be able to get rid of this shrink nonsense if I can just make larger indices. The issue I'm currently trying to mitigate with shrinking is a large cluster state, an issue your team mitigated in 7.6. I'm going to up the index size to 20TB / 40 primaries = 50GB shards, and then just get rid of the shrink altogether. That should cause indices to roll over at a more-reasonable rate (every ~15 hours). Thoughts about these numbers?

You're right, I'm disabling all of the shrinking and force-merging. I need to keep this data around for 90 days. Also, if I need to, I saw that nodes default to 1 thread performing merging, I can potentially increase that if I need to revisit merging in the future.

I saw this message frequently before implementing bulk indexing.

I'll DM you the full paste in a few minutes here. edit: @DavidTurner it appears you have messages turned off in the Discuss stack (this site).

After realizing that 2B docs is the hard limit per shard, and not per-index, I'm going to make my indices much larger, and just get rid of shrinking altogether. It's a cool mechanism but the amount of copies it has to do causes my cluster noticeable overhead.

I'm going to increase this as well - just for future recovery scenarios.

This setting is documented as a heuristics setting, thus my understanding is a higher number indicates more preference towards shards being spread out.

I have 6 client nodes which I believe operate as coordinating-only nodes - they also route requests for (3) Kibana instances as well. I have my processing engines setup to look for clients, and sniffing is set to false, to disable discovery of the data / master nodes. I was surprised to see the default for the official clients have sniffing turned on, but I understand that's likely a compatibility feature for newer users.

Then it sounds like networking performance most likely is not an issue. :slight_smile:

I think you guys solved it. I want to let this bake a little bit because some shrinking from old indices is still causing some GC on nodes, but preliminary results look way better:

I think replication is struggling to keep up. That may be where ILM still comes into play - upon rolling over an index, perhaps then is when I'll replicate it..

I'm going to let this bake in for a few hours and then I'll follow up once the cluster has stabilized. Thanks for all your help fellas. There were many key recommendations here which were great advice. With Elasticsearch I've found there are a thousand knobs to turn, and at this scale, if you forget one knob, or turn it the wrong way, your whole cluster can destabilize. What a powerful product :slight_smile:

Nope, index.routing.allocation.total_shards_per_node is not one of the settings mentioned at that link. It's definitely a hard-and-fast limit not a heuristic.

I was using index.routing.allocation.balance.index: 5, which is listed as a heuristic. I agree the setting you mentioned doesn't currently work (due to outstanding ILM issues):
index.routing.allocation.total_shards_per_node

I'm still using index.routing.allocation.total_shards_per_node: 1 since I'm no longer shrinking indices, due to a much larger index size. It reliably allocates 1 shard per server.