Hi all,
I'm having one data node of 6 dedicated nodes have very high CPU utilization. I'm hoping to have some help figuring out what is causing it.
Here is the output from "top".
top - 17:51:19 up 91 days, 3:38, 1 user, load average: 25.01, 23.68, 17.83
Tasks: 671 total, 1 running, 670 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 40.7%sy, 0.0%ni, 59.2%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 65858520k total, 64972200k used, 886320k free, 233832k buffers
Swap: 8388600k total, 24064k used, 8364536k free, 26568692k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32206 elastics 20 0 763g 38g 6.4g S 1304.9 61.8 2187:13 java
You can see that the load average is crazy high, and CPU utilization is as well. The problem does not fix itself (I've seen it run over 2 days), and all I can do is cycle the JVM.
This is a 16 core machine.
This is the Java version:
root@bdprodes08:[218]:~> java -version
java version "1.7.0_79"
OpenJDK Runtime Environment (rhel-2.5.5.3.el6_6-x86_64 u79-b14)
OpenJDK 64-Bit Server VM (build 24.79-b02, mixed mode)
I'm running ES 1.6.0.
Here is the (very snipped) output of the hot_threads API for this node:
::: [elasticsearch-bdprodes08][MWIHO5NQSjChyPm196c5ug][bdprodes08][inet[/10.200.116.248:9300]]{master=false}
Hot threads at 2015-10-12T17:51:45.866Z, interval=500ms, busiestThreads=10, ignoreIdleThreads=true:
100.8% (504.2ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-bdprodes08][bulk][T#4]'
9/10 snapshots sharing following 14 elements
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1526)
org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1252)
org.elasticsearch.index.engine.InternalEngine.innerCreateNoLock(InternalEngine.java:356)
org.elasticsearch.index.engine.InternalEngine.innerCreate(InternalEngine.java:298)
org.elasticsearch.index.engine.InternalEngine.create(InternalEngine.java:269)
org.elasticsearch.index.shard.IndexShard.create(IndexShard.java:483)
100.8% (504.2ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-bdprodes08][[derbysoft-ihg-20151012][0]: Lucene Merge Thread #174]'
9/10 snapshots sharing following 20 elements
java.io.FileOutputStream.writeBytes(Native Method)
java.io.FileOutputStream.write(FileOutputStream.java:345)
org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:390)
java.util.zip.CheckedOutputStream.write(CheckedOutputStream.java:73)
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82)
java.io.BufferedOutputStream.write(BufferedOutputStream.java:95)
org.apache.lucene.store.OutputStreamIndexOutput.writeByte(OutputStreamIndexOutput.java:45)
100.8% (504.1ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-bdprodes08][[derbysoft-ihg-20151012][0]: Lucene Merge Thread #181]'
2/10 snapshots sharing following 21 elements
java.io.FileOutputStream.writeBytes(Native Method)
java.io.FileOutputStream.write(FileOutputStream.java:345)
org.apache.lucene.store.FSDirectory$FSIndexOutput$1.write(FSDirectory.java:390)
83.7% (418.2ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-bdprodes08][[nuke_metrics][1]: Lucene Merge Thread #66]'
unique snapshot
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:476)
16.9% (84.4ms out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-bdprodes08][[derbysoft-bestwestern-20151012][0]: Lucene Merge Thread #125]'
unique snapshot
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:476)
2/10 snapshots sharing following 21 elements
It looks like a combination of merges and bulk inserts, but I'm not sure why this one node in particular is having issues, while the rest are not. Right now, I'm having to cycle this node about every 2 or three days, which on this cluster size is somewhat of a pain.
Here are my cluster settings for merging:
"index.merge.scheduler.max_thread_count": "1",
"index.merge.policy.type": "tiered",
"index.merge.policy.max_merged_segment": "5gb",
"index.merge.policy.segments_per_tier": "10",
"index.merge.policy.max_merge_at_once": "10",
"index.merge.policy.max_merge_at_once_explicit": "10",
"index.translog.flush_threshold_size": "1gb",
The data directory is made up of 8 2TB SATA disks in a Raid 0 stripe.
Can anyone see something misconfigured, or suggest something to try to fix this?
Thank you so much for your time and help!
Chris