we got a quite strange problem on one of our elasticsearch server - every 20-40sec ES stops using CPU resources and get in a "pause" time - see screenshot below.
About the Environment:
-64CPU Cores, 256GB RAM, ES 1.7.5 (Upgrading is NOT a choice atm), 4 Instances a 30GB Heap Running on the server, Rest is Cache for the ZFS-Filesystem that is installed on the SSDs where the data is.
My first thought was that there are big GCs running and the GC pauses the ES-Application, however i turned on debug logging and set the GC log times quite low and there is no crazy 15sec Garbagecollection running when the CPU Pause times appear.
Since we got 4 hardware identical server with completely identical config and i could not expain why this happen on only 1 specific server, so i went on and reinstalled the Debian Jessie on this machine to make sure it is no OS-Problem - without any success.
I also checked if there is any system process spawning when the phenomenon is happening, but again there is nothing running except the 4 ES Instances.
Anyone got a tipp for me to where start debugging this?
I would expect merging to take a fair bit of CPU utilization, though it also is heavy on disk I/O. However, I'm not sure it would account for such long CPU pauses. You can find stats for segment merging at curl -XGET 'http://localhost:9200/_nodes/stats' (look under the "merge" section).
If i look into the ongoing merging i see about 10-20Gb of "open" merging and about 15-25 merging threads, that compared to an daily index size of about 1.5-1.9TB seems not much.
Anything i should watch in particular that could help me debug this?
We have ZFS Compression with LZ4 activated(small to non-existent overhead) , no dedup though.
I see an iowait of 1.2 - 2.8% if i check with iostat linux tool.
During the Pauses the iowait "raises" to 3.8-4.9%.
I did some more testing and it turns out it is indeed caused by the merging process, it seems like my SSDs cant keep up with the merging if there is heavy index load.
I stopped the indexing for about 30 minutes and let the merging process finish ( i saw a huge amount of merging finished debug logs in the es-log)
As i now startet the indexing process again, the cpu pauses did not occur again, only if quite some merging accumalate, than the pauses appear again.
It seems like i need to add just more nodes to handle all the indexing load or lower the amount of indexing.
I'm not sure what your indexing rates are, but yes adding more nodes will definitely help with indexing and indeed it may be the case that you've just overloaded your cluster and need to add more servers.
I am curious though as to why it is just this one server in your cluster that is experiencing the performance problems. Is there anything different about this server as compared to the others?
Looking at the hardware of the server, they are 100% identical, however the one that does make probs is the oldest one, maybe SSDs are starting to die or sth like this.
I did have a look at your posted link and increased the translog size as well as translog interval and the threshold_ops - we do not have real time requirements, since it is pure logging case and i got enough heap/memory left, it seems like it is running smoother now a bit.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.