I am looking for some advice as to why we are seeing a high GC time on our Elasticsearch cluster. On average, we see 5 - 8% of GC time across all the nodes. This is the setup we have:
150 data nodes
1000 primary shards and each has 2 replicas
each data node is receiving max up to 2k indexing calls, but avg is usually 200
avg doc size = 3 kb
refresh_interval = 10 sec
We provide the document id and an external version, so it is doing a lot of merges
Is this 2k bulk indexing requests per second? - No, that is individual indexing count, but we use bulk indexing. Per node, we see max 2k/sec and avg 300/sec. For the entire cluster, we see max 30k/sec and avg 10k/sec.
Is this a single index or multiple indices? - It is a single index
This version is often used with third-party plugins, which naturally can affect heap usage and GC. Do you have any third-party plugins installed?
What is the size of gp2 storage per node? Small volumes can have very low IOPS which can cause performance issues. Given the number of shards in your index I would expect a lot of small writes to shards even if you use quite large bulk requests, which could overhead. The fact that you are updating and using your own IDs will also add a lot of reads. What does I/O performance look like if you run e.g. iostat -x on the data nodes?
The stats you provided shows that you seem to be using a JVM that is not officially supported according to the official support matrix. Not sure if this could have any impact or special configuration considerations.
Ya, we have a few plugins installed. Is there a way to see what impact it could be having?
30 GB
A few general questions:
Is there a max indexing rate per node/shard per sec?
With our current indexing rate, if we do change some configuration, would it be possible to have a very low refresh interval like 1 sec OR that is just too low for our volume? Could adding more data nodes help or splitting the cluster?
TBH the simplest piece of tuning advice we can offer is to use a newer version. 7.10.2 passed EOL a long time ago, and there's been a lot of performance-related improvements in the ~2½ years since it was released, some of which I would expect to reduce the load on the GC.
You could try using something like async profiler to understand where the time is being spent and/or what bits of the system are allocating the most heap objects (which directly contributes to GC pressure). Maybe that will point to problems in a particular plugin.
30k docs per sec does seem very low for such a large cluster. It very much depends on the details of the workload, but at least some of the nightly benchmarks achieve several times that ingest rate using just 3 nodes running recent versions.
@DavidTurner@Christian_Dahlqvist wanted to follow up. We did some testing and analysis. For example: we set up an index with a lot of mapping properties like 10 - 12k and saw a huge increase in GC, up to 20%. Our heap is 27 GB. Based on a few use cases:
Do # of mappings in an index have an impact on GC or basically like how the heap is populated?
Is the heap somehow split between writes and reads? We are doing writes on this cluster (just for testing) and see around 50 - 60% heap used, but GC is so high, so wondering if something changed in es7?
Could the indexing buffer setting impacting it somehow ? Like a 10% indices.memory.index_buffer_size means we are not flushing mesages to disk fast enough and the heap is getting filled up?
We just finished upgrading to 7 a few months back, so we are a bit behind here. For the async profiler, I will try that next. I was setting up some use cases in our staging environment and had those questions.
@DavidTurner We did the async profiler for allocations on two clusters. The prod cluster has ~6% GC and QA cluster has 1% (not bad but just wanted to compare). There is no clear outlier, but the topmost caller is org/apache/lucene/index/DefaultIndexingChain.flush path. Here are some other stats:
Thanks @ktech007, can you share the profiler options you used too? I think we want to see allocation profiles and wall-clock profiles, which are these?
We ran the allocation profile (/opt/async-profiler/profiler.sh -e alloc -d 30 ) and wall clock profile (/opt/async-profiler/profiler.sh -e wall -t -i 5ms) for two clusters that are seeing high GC.
So I have an idea. I have been running some tests.
Let's say an index has 3000 mappings and indexing doc with ~100 fields. If we are indexing documents with similar fields, GC is fine. But if I start indexing documents that all have very varied fields, GC starts to go high.
Do you know if there is a cache or something we can adjust?
Ok, thank you. One last question, so high GC would basically impact indexing speed, right? We want to reduce our refresh_interval but one of the concerns was high GC. Do you think it would be ok to reduce it?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.