I have tried optimizing the process by adjusting batch sizes and configuring the above settings, but the performance remains suboptimal. Here are some specific questions and additional steps I've considered:
JVM Heap Size:
Current setting is default. Would increasing the heap size improve performance, or are there recommended settings for my cluster specifications?
Index Sharding:
I'm using the default settings. Should I consider increasing the number of primary shards? If so, what would be a good starting point?
Bulk Request Size:
I've experimented with different batch sizes. What is the optimal batch size for bulk requests given my data volume and cluster resources?
Thread Pool Settings:
Are there specific thread pool settings for bulk or index operations that could enhance performance?
Throttling:
Would implementing request throttling help manage the load better?
Disk I/O and Network Bandwidth:
Could the underlying disk I/O or network bandwidth be bottlenecks? Any recommendations for monitoring and improving these?
Monitoring and Logging:
What are the best practices for monitoring Elasticsearch logs and metrics to identify and diagnose performance issues?
Kubernetes Configuration:
Are there Kubernetes-specific settings or configurations that could help optimize Elasticsearch performance?
Any advice, best practices, or additional configuration tips to improve the bulk insertion performance and resolve the connection timeout issues would be greatly appreciated.
Is this the specification of each node? How many nodes do you have?
What type of storage are you using?
The efault should be 50% of available RAM so no need to change this. Larger heap does not improve performance as long as you are not experiencing heap pressure and lond and/or frequent GC.
This would depend on the number of nodes in your cluster. If you only have a single node the default is probably fine for now.
A common recommendation is to aim for a bulk size of around a couple of MB or so.
I assume this refers to Elasticsearch settings. The default settings are usually fine so I would not recommend changing these.
How have you determined that Elasticsearch is the bottleneck?
I have often seen users troubleshoot indexing performance just to realise that they are not indexing with sufficient level of concurrency to actually saturate the cluster. How many processes/threads do you have indexing data into Elasticsearch?
Do you perform updates of documents and therefore need to have an externally generated ID? If not, using an external ID generally reduces indexing performance compared to letting Elasticsearch assign the ID as each insert need to be treated as a potential update.
That depends on what the issue is.
Storage performance is one of the most common bottlenecks for indexing as it is I/O intensive. This is why I asked about what type of storage you are using earlier.
I am not an Azure user, but based on what I have seen in the past I would recommend using premium SSDs for Elasticsearch. Monitor disk I/O and await and see if that is showing signs of storage saturation or large latencies.
Autoscaling is tricky for stateful data stores like Elasticsearch, so I would recommend against using it. Note that if you use autoscale to add another master eligible node and then remove it, you are likely to end up with a cluster that can not be recovered, so make sure you use the snapshot API to back up the cluster frequently.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.