Elasticsearch 1.5.1 Java process 100% plus CPU usage

We have a 8 node ELK cluster in which 5 are data nodes and 3 are logstash nodes.

We have been experiencing extremely high CPU usage in which the Java process is taking 100 - 350% of CPU usage.

Data nodes are configured as followed

4vCPU
32GB RAM (16 allocated to ES_HEAP_SIZE)
1.4TB iSCSI Volume with 700GBs used
Swap is disabled.

CPU Load Avg from one of the nodes
load average: 12.13, 12.21, 10.83

Results from hot threads

curl http://localhost:9200/_nodes/hot_threads

31.9% (159.3ms out of 500ms) cpu usage by thread 'elasticsearch[apples02.atl.ucloud.int][search][T#7]'
4/10 snapshots sharing following 33 elements
org.apache.lucene.codecs.blocktree.SegmentTermsEnum.docs(SegmentTermsEnum.java:999)
org.apache.lucene.index.FilteredTermsEnum.docs(FilteredTermsEnum.java:189)
org.apache.lucene.index.FilteredTermsEnum.docs(FilteredTermsEnum.java:189)
org.elasticsearch.index.fielddata.ordinals.OrdinalsBuilder$3.next(OrdinalsBuilder.java:473)
org.elasticsearch.index.fielddata.plain.PackedArrayIndexFieldData.loadDirect(PackedArrayIndexFieldData.java:109)
org.elasticsearch.index.fielddata.plain.PackedArrayIndexFieldData.loadDirect(PackedArrayIndexFieldData.java:49)
org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$1.call(IndicesFieldDataCache.java:180)
org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$1.call(IndicesFieldDataCache.java:167)
org.elasticsearch.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4742)
org.elasticsearch.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
org.elasticsearch.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
org.elasticsearch.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937)
org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)

28.9% (144.7ms out of 500ms) cpu usage by thread 'elasticsearch[apples02.atl.ucloud.int][search][T#6]'
10/10 snapshots sharing following 29 elements
org.elasticsearch.index.fielddata.plain.PackedArrayIndexFieldData.loadDirect(PackedArrayIndexFieldData.java:109)
org.elasticsearch.index.fielddata.plain.PackedArrayIndexFieldData.loadDirect(PackedArrayIndexFieldData.java:49)
org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$1.call(IndicesFieldDataCache.java:180)
org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$1.call(IndicesFieldDataCache.java:167)
org.elasticsearch.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4742)
org.elasticsearch.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527)
org.elasticsearch.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319)
org.elasticsearch.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937)
org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739)
org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:167)
org.elasticsearch.index.fielddata.plain.AbstractIndexFieldData.load(AbstractIndexFieldData.java:74)
org.elasticsearch.search.facet.datehistogram.CountDateHistogramFacetExecutor

Is it GC?
Can you upgrade?

Excuse my ignorance but what does GC mean? Upgrade is in the plans.

GC = Java Garbage Collection. Something you would see in logs.
Can you share them BTW?

Something I have seen in the past was in the context of EC2 when using cloud-aws plugin and was caused by the leap second as a side effect.

Not sure if it's your case. I believe not as you don't seem to use EC2 here. But anyway may be check at least that all clocks are aligned.

Did you try to restart your cluster BTW?

Yeah we have restarted the cluster many many times, we just added 3 masters and dedicated them just for that role. Here is link to the logs located under /var/log/elasticsearch
https://app.box.com/s/qnosibz2pn4jk7vc42e2iyvdd60szhnm

Link to hot threads
https://app.box.com/s/lsws1vjlr46xg1evod8zakt0srct0elp

Also a view at thread_pools
curl -s localhost:9200/_cat/thread_poo... Sat Jan 21 20:13:14 2017

apples01 10.100.101.59 3 0 0 0 0 0 12 1000 11344
apples02 10.100.101.60 0 0 0 0 0 0 12 1000 24909
apples03 10.100.101.61 0 0 0 0 0 0 3 1 7385
apples04 10.100.101.113 0 0 0 0 0 0 12 1000 15914
apples05 10.100.101.203 2 0 0 0 0 0 12 881 10762
applesm001 10.100.101.27 0 0 0 0 0 0 0 0 0
applesm002 10.100.101.28 0 0 0 0 0 0 0 0 0
applesm003 10.100.101.29 0 0 0 0 0 0 0 0 0
applogstash01 10.100.101.56 0 0 0 0 0 0 0 0 0
applogstash02 10.100.101.57 0 0 0 0 0 0 0 0 0
applogstash03 10.100.101.58 0 0 0 0 0 0 0 0 0

Forgot to mention this environment is hosted in house.

What would be the recommended upgrade path from elasticsearch 1.5.1?

https://www.elastic.co/guide/en/elasticsearch/reference/5.1/setup-upgrade.html will help you.

It turns out that it was java garbage collection, had to double the RAM on the data nodes from 32 to 64GBs and allocated 31GB to the Java Heap. Thank you

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.