I'm curious about settings that will increase my index rate, at the risk of losing data in the event of hardware corruption or similar. My datanodes are heavily CPU-constrained, so anything I can do to reduce the CPU load will cause increased throughput.
I've got a large cluster of about 80 datanodes that are processing analytical data at a rate of ~250k docs/s on the primary shards (~500k/s with replication). As expected, it's taken a significant amount of tweaking (at the cluster, node, & template level) to get the cluster able to process this many documents simultaneously. However, I'm still losing data, as the stream is probably closer to ~350k docs/s. I'm hoping to gather some insight into some of the settings I'm tweaking, as there is no realistic possibility for me to set up a staging cluster of this size for testing purposes.
What I'm hoping to achieve is basically disabling the translog
I have been investigating the translog settings, and set the durability to async, however it will still commit the translog on some sync interval. Would it be sufficient to just set the sync_interval
to something greater than ~90 minutes (how long an index is written to before rolling over)? Or can I set it to zero to disable? Curious about the ramifications this may have for memory...
Also, what about soft deletes?
I'm looking to update the soft_deletes.retention_lease.period
to something incredibly small, e.g. "1s".
Here's the relevant settings in the index template:
{
"device_all" : {
"order" : 10000,
"version" : 30500,
"index_patterns" : [
"device-*"
],
"settings" : {
"index" : {
"lifecycle" : {
"name" : "device_rollover",
"rollover_alias" : "device_all"
},
"codec" : "default",
"allocation" : {
"max_retries" : "10"
},
"mapping" : {
"total_fields" : {
"limit" : "200"
}
},
"refresh_interval" : "30s",
"number_of_shards" : "25",
"translog" : {
"sync_interval" : "600s",
"durability" : "async"
},
"soft_deletes" : {
"retention_lease" : {
"period" : "1s"
}
},
"query" : {
"default_field" : [ ]
},
"unassigned" : {
"node_left" : {
"delayed_timeout" : "15m"
}
},
"number_of_replicas" : "1"
}
},
...
Cluster stats:
ESv7.4.1
Using 26 nodes to bulk_insert 5k docs at a time.
6 Client Nodes -
3 Master Nodes
80 Data nodes