Optimizing Bulk Insertion in Elasticsearch 8.14 on Azure Kubernetes

Hi Elasticsearch Community,

I'm currently using Elasticsearch 8.14 in an Azure Kubernetes cluster with the following specifications:

  • Cluster Resources: 16GB RAM, 4 vCPUs
  • Data: 400GB total, 2.7 billion documents
  • Workflow: Using Argo workflows
  • Insertion Rate: Approximately 200,000 documents per minute for some files

Current Configuration:

  • Explicit Mappings: Defined
  • refresh_interval: -1
  • replicas: 0
  • Document ID: Generated using a hash key
  • Bulk Insertion API: Implemented in Python

Issues:

  • Connection Timeouts: Encountered when initiating multiple files
  • Slow Insertion Rates: Inconsistent insertion speeds

I have tried optimizing the process by adjusting batch sizes and configuring the above settings, but the performance remains suboptimal. Here are some specific questions and additional steps I've considered:

  1. JVM Heap Size:
    Current setting is default. Would increasing the heap size improve performance, or are there recommended settings for my cluster specifications?
  2. Index Sharding:
    I'm using the default settings. Should I consider increasing the number of primary shards? If so, what would be a good starting point?
  3. Bulk Request Size:
    I've experimented with different batch sizes. What is the optimal batch size for bulk requests given my data volume and cluster resources?
  4. Thread Pool Settings:
    Are there specific thread pool settings for bulk or index operations that could enhance performance?
  5. Throttling:
    Would implementing request throttling help manage the load better?
  6. Disk I/O and Network Bandwidth:
    Could the underlying disk I/O or network bandwidth be bottlenecks? Any recommendations for monitoring and improving these?
  7. Monitoring and Logging:
    What are the best practices for monitoring Elasticsearch logs and metrics to identify and diagnose performance issues?
  8. Kubernetes Configuration:
    Are there Kubernetes-specific settings or configurations that could help optimize Elasticsearch performance?

Any advice, best practices, or additional configuration tips to improve the bulk insertion performance and resolve the connection timeout issues would be greatly appreciated.

Thank you in advance for your assistance!

Best regards,
Joshua R

Is this the specification of each node? How many nodes do you have?

What type of storage are you using?

The efault should be 50% of available RAM so no need to change this. Larger heap does not improve performance as long as you are not experiencing heap pressure and lond and/or frequent GC.

This would depend on the number of nodes in your cluster. If you only have a single node the default is probably fine for now.

A common recommendation is to aim for a bulk size of around a couple of MB or so.

I assume this refers to Elasticsearch settings. The default settings are usually fine so I would not recommend changing these.

How have you determined that Elasticsearch is the bottleneck?

I have often seen users troubleshoot indexing performance just to realise that they are not indexing with sufficient level of concurrency to actually saturate the cluster. How many processes/threads do you have indexing data into Elasticsearch?

Do you perform updates of documents and therefore need to have an externally generated ID? If not, using an external ID generally reduces indexing performance compared to letting Elasticsearch assign the ID as each insert need to be treated as a potential update.

That depends on what the issue is.

Storage performance is one of the most common bottlenecks for indexing as it is I/O intensive. This is why I asked about what type of storage you are using earlier.

Given your description, this is what I would check first. Note that the official guide on tuning for indexing speed recommend using local SSDs for a good reason.

Hi @Christian_Dahlqvist,

Thanks for the quick response.

I have one node with specified resources, and it is autoscale enabled, so it will increase depending on the number of jobs running.

I'm using a standard SSD LRS 2TB with higher IOPS.

Yes, we are doing an update on records, which is why the default ID given by Elasticsearch doesn't work here.

Looking forward to any additional tips or suggestions you might have.

Best regards,
Joshua R

I am not an Azure user, but based on what I have seen in the past I would recommend using premium SSDs for Elasticsearch. Monitor disk I/O and await and see if that is showing signs of storage saturation or large latencies.

Autoscaling is tricky for stateful data stores like Elasticsearch, so I would recommend against using it. Note that if you use autoscale to add another master eligible node and then remove it, you are likely to end up with a cluster that can not be recovered, so make sure you use the snapshot API to back up the cluster frequently.

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.