High CPU Pause Time

Hi guys,

we got a quite strange problem on one of our elasticsearch server - every 20-40sec ES stops using CPU resources and get in a "pause" time - see screenshot below.

About the Environment:
-64CPU Cores, 256GB RAM, ES 1.7.5 (Upgrading is NOT a choice atm), 4 Instances a 30GB Heap Running on the server, Rest is Cache for the ZFS-Filesystem that is installed on the SSDs where the data is.

My first thought was that there are big GCs running and the GC pauses the ES-Application, however i turned on debug logging and set the GC log times quite low and there is no crazy 15sec Garbagecollection running when the CPU Pause times appear.

Since we got 4 hardware identical server with completely identical config and i could not expain why this happen on only 1 specific server, so i went on and reinstalled the Debian Jessie on this machine to make sure it is no OS-Problem - without any success.

I also checked if there is any system process spawning when the phenomenon is happening, but again there is nothing running except the 4 ES Instances.

Anyone got a tipp for me to where start debugging this?

Thanks a lot

I also turned on the TRACE Logging now.

I am by far not a "specialist" in reading those complex logs, but it seems as soon as the "Pause Times" starts, an Lucene Merge Thread get spawned.

[logstash-2016.06.27][7] elasticsearch[nodename][[logstash-2016.06.27][7]: Lucene Merge Thread #77] TMP: maybe=_3ueo(4.10.4):c66003 _4bxm(4.10.4):c54538 _4cae(4.10.4):c50152 _4cas(4.10.4):c47271 _4ch2(4.10.4):c42140 _4cgl(4.10.4):c36120 _4cbi(4.10.4):c34035 _4c7g(4.10.4):c28552 _4ccg(4.10.4):c27083 _4cep(4.10.4):c24812 score=0.42923177416322084 skew=0.156 nonDelRatio=1.000 tooLarge=false size=614.593 MB

Is it possible that the Merging Process triggers CPU Pause Times maybe?

Hello,

I would expect merging to take a fair bit of CPU utilization, though it also is heavy on disk I/O. However, I'm not sure it would account for such long CPU pauses. You can find stats for segment merging at curl -XGET 'http://localhost:9200/_nodes/stats' (look under the "merge" section).

What version of ES are you on?

Hi,

thanks for response.

I am running ES 1.7.5.

If i look into the ongoing merging i see about 10-20Gb of "open" merging and about 15-25 merging threads, that compared to an daily index size of about 1.5-1.9TB seems not much.

Anything i should watch in particular that could help me debug this?

thanks for response

Can you share the I/O wait stats that you're seeing, especially during these pauses?

Also, do you have ZFS dedup turned on? If so, turn it off.

We have ZFS Compression with LZ4 activated(small to non-existent overhead) , no dedup though.

I see an iowait of 1.2 - 2.8% if i check with iostat linux tool.
During the Pauses the iowait "raises" to 3.8-4.9%.

I did some more testing and it turns out it is indeed caused by the merging process, it seems like my SSDs cant keep up with the merging if there is heavy index load.

I stopped the indexing for about 30 minutes and let the merging process finish ( i saw a huge amount of merging finished debug logs in the es-log)

As i now startet the indexing process again, the cpu pauses did not occur again, only if quite some merging accumalate, than the pauses appear again.

It seems like i need to add just more nodes to handle all the indexing load or lower the amount of indexing.

I really appreciate your help!

Regards

No problem! Here is a section on the docs with some parameters to try and vary for segment merging and throttling: https://www.elastic.co/guide/en/elasticsearch/guide/current/indexing-performance.html#segments-and-merging

I'm not sure what your indexing rates are, but yes adding more nodes will definitely help with indexing and indeed it may be the case that you've just overloaded your cluster and need to add more servers.

I am curious though as to why it is just this one server in your cluster that is experiencing the performance problems. Is there anything different about this server as compared to the others?

Looking at the hardware of the server, they are 100% identical, however the one that does make probs is the oldest one, maybe SSDs are starting to die or sth like this.

I did have a look at your posted link and increased the translog size as well as translog interval and the threshold_ops - we do not have real time requirements, since it is pure logging case and i got enough heap/memory left, it seems like it is running smoother now a bit.

Thank you for your help