New Document Indexing Performance Troubleshooting

So I've been trying to troubleshoot an issue with my Elasticsearch currently being used a production system. Our servers are hosted in AWS and the specs of each node are Standard D16s v3 (16 vcpus, 64 GiB memory, 1.5TB of SSD storage), and the cluster has 30 nodes.

We're using Nifi to index new records into ES, and the Nifi PutElasticsearchHTTPRecord processor configuration is:

Until recently, our Elastic system was handling about 8TB of data without much issue. However in the last two months or so we've taken on more data, bringing the total indexed data to about 18TB. Since this happened, the speed at which new records get indexed into ES has dropped from about 4GB/min to about 20MB/min.

I've made a few configuration settings to try to free up CPU/memory for indexing the new records. Here are the new settings:

PUT /<all of my data indices>/_settings
  "index.blocks.read_only_allow_delete": null,
  "index.translog.sync_interval": "60s",
  "index.refresh_interval" : "60s"

PUT _cluster/settings
  "persistent" : {
    "cluster.max_shards_per_node" : 1000,
    "cluster.routing.allocation.total_shards_per_node" : null,
    "cluster.routing.allocation.enable": "all",
    "cluster.routing.rebalance.enable": "none",
    "cluster.routing.allocation.allow_rebalance": "indices_all_active",
    "cluster.routing.allocation.cluster_concurrent_rebalance": 2,
    "cluster.routing.allocation.node_concurrent_recoveries": 2,
    "cluster.routing.allocation.balance.threshold": 1.0,
    "cluster.routing.allocation.disk.watermark.low": "70%",
    "cluster.routing.allocation.disk.watermark.high": "95%",
    "cluster.routing.allocation.disk.watermark.flood_stage": "98%"

My overview monitoring looks mostly normal, aside from the indexing speed being incredibly low with occasional spikes up to the old speed:

Finally, it looks like a few nodes in my cluster have high heap usage, but I'm not seeing much in the way of expensive searches or aggregations being run.

I'm honestly at a bit of a loss for what could be causing such degraded performance, and the random spikes of "normal" performance are even more confusing. Any help or ideas would be greatly appreciated.

Which version of Elasticsearch are you using?

What type of storage are you using? Is it gp3 EBS?

What bulk size is NiFi configured to use?

How many indices and shards are you actively indexing into?

What is the average size of the shards you are indexing into?

Are you assigning your own document IDs or letting Elasticsearch generate IDs for you?

Can you share the full output of the cluster stats API?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.