Elasticsearch is spanning multiple processes and threads resulting in 100% CPU utilization

Hi Elastic users,

We are using ES 2.4.5 along with Liferay DXP, we have set it up with the default configuration.

We are having (1node, 1 shard and 0 replicas) on ES 2.4.5 running on EC2 (4CPU, 16GB RAM).

On a daily basis we are getting errors which makes our application hung with CPU utilization around 100%.

When we checked the ES logs we get the following error,

"
[2018-06-30 11:24:36,996][WARN ][cluster.routing.allocation.decider] [EDEL_PROD] high disk watermark [90%] exceeded on [aFU5IsoLSnqE7IRip8vS2Q][EDEL_PROD][/var/lib/elasticsearch/EDEL_PROD_ES/nodes/0] free: 648.1mb[8.1%], shards will be relocated away from this node
[2018-06-30 11:25:07,036][WARN ][cluster.routing.allocation.decider] [EDEL_PROD] high disk watermark [90%] exceeded on [aFU5IsoLSnqE7IRip8vS2Q][EDEL_PROD][/var/lib/elasticsearch/EDEL_PROD_ES/nodes/0] free: 647.8mb[8.1%], shards will be relocated away from this node
[2018-06-30 11:25:07,036][INFO ][cluster.routing.allocation.decider] [EDEL_PROD] rerouting shards: [high disk watermark exceeded on one or more nodes]
[2018-06-30 11:25:23,446][INFO ][node ] [EDEL_PROD] stopping ...
[2018-06-30 11:25:23,548][DEBUG][action.search ] [EDEL_PROD] [liferay-20116][1], node[aFU5IsoLSnqE7IRip8vS2Q], [P], v[2], s[STARTED], a[id=7KUmuNqaQ9uzdPajCJl-NQ]: Failed to execute [org.elasticsearch.action.search.SearchRequest@faa0617] lastShard [true]
TransportException[transport stopped, action: indices:data/read/search[phase/query]]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2018-06-30 11:25:23,558][DEBUG][action.search ] [EDEL_PROD] [liferay-20116][3], node[aFU5IsoLSnqE7IRip8vS2Q], [P], v[2], s[STARTED], a[id=TM6XUZd7RKu1p4a0gUaKRA]: Failed to execute [org.elasticsearch.action.search.SearchRequest@faa0617] lastShard [true]
TransportException[transport stopped, action: indices:data/read/search[phase/query]]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2018-06-30 11:25:23,558][DEBUG][action.search ] [EDEL_PROD] [liferay-20116][0], node[aFU5IsoLSnqE7IRip8vS2Q], [P], v[2], s[STARTED], a[id=hYw_B9ELSnqq635LB-cniw]: Failed to execute [org.elasticsearch.action.search.SearchRequest@faa0617] lastShard [true]
TransportException[transport stopped, action: indices:data/read/search[phase/query]]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2018-06-30 11:25:23,566][DEBUG][action.search ] [EDEL_PROD] [liferay-20116][2], node[aFU5IsoLSnqE7IRip8vS2Q], [P], v[2], s[STARTED], a[id=a1ce9XYQS7eF7DW0TrnB8A]: Failed to execute [org.elasticsearch.action.search.SearchRequest@2b88c1f6] lastShard [true]
TransportException[transport stopped, action: indices:data/read/search[phase/query]]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2018-06-30 11:25:23,566][DEBUG][action.search ] [EDEL_PROD] [liferay-20116][1], node[aFU5IsoLSnqE7IRip8vS2Q], [P], v[2], s[STARTED], a[id=7KUmuNqaQ9uzdPajCJl-NQ]: Failed to execute [org.elasticsearch.action.search.SearchRequest@2b88c1f6] lastShard [true]
TransportException[transport stopped, action: indices:data/read/search[phase/query]]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
[2018-06-30 11:25:23,566][DEBUG][action.search ] [EDEL_PROD] [liferay-20116][2], node[aFU5IsoLSnqE7IRip8vS2Q], [P], v[2], s[STARTED], a[id=a1ce9XYQS7eF7DW0TrnB8A]: Failed to execute [org.elasticsearch.action.search.SearchRequest@faa0617] lastShard [true]
TransportException[transport stopped, action: indices:data/read/search[phase/query]]
at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
"

Any hint will be appreciated.

Thanks in advance.

The reason for these error messages is that you are running low on disk space. What is the output of the cat indices API? How much storage do you have available for the node?

Hi Christian,

Thanks for your quick reply, we have increased memory and this error "[2018-06-30 11:24:36,996][WARN ][cluster.routing.allocation.decider] [EDEL_PROD] high disk watermark [90%] exceeded on [aFU5IsoLSnqE7IRip8vS2Q][EDEL_PROD]" goes away.

However, we are still facing the issue where our CPU utilization goes to 100% and make site unresponsive.

Please run the hot threads API a couple of times when CPU is at 100% and post the results here. Is there anything about long or frequent GC in the logs?

We have tried to run the suggested API but got following error,

Please suggest some work around to get this thing sorted.

I can not read that. Please do not post images of text. Instead copy and paste and make sure you format it properly using the tools in the UI.

As discussed on call, please find the result of curl command with different combination of IP and Port no.s for getting the elastic search hot threads.

[centos@ip-10-0-2-109 ~]$ curl -XGET 'localhost:9300/_node/hot_threads?human=true'
curl: (7) couldn't connect to host

[centos@ip-10-0-2-109 ~]$ curl -XGET '10.0.2.109:9300/_node/hot_threads?human=true'
This is not a HTTP port

[centos@ip-10-0-2-109 ~]$ curl -XGET '10.0.2.109:9200/_node/hot_threads?human=true'
{"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"No feature for name

[hot_threads]"}],"type":"illegal_argument_exception","reason":"No feature for name [hot_threads]"},"status":400}

[centos@ip-10-0-2-109 ~]$ curl -XGET '10.0.2.109:80/_node/hot_threads?human=true'

301 Moved Permanently

301 Moved Permanently


nginx/1.10.2

What call?

If you look in the docs I linked to the URL is _nodes/hot_threads, not _node/hot_threads.

I am getting an error while executing the above command,

[centos@ip-10-0-2-109 ~]$ curl -XGET '10.0.2.109:9300/_nodes/hot_threads?human=true'
This is not a HTTP port

Use port 9200, not 9300.

Please find the response of curl command,

New Hot threads
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
102 716 102 716 0 0 1174 0 --:--:-- --:--:-- --:--:-- 1173
102 716 102 716 0 0 1173 0 --:--:-- --:--:-- --:--:-- 1173
::: {EDEL_PROD}{zdCNenBjT2GzgAqzcAlOQw}{10.0.2.109}{10.0.2.109:9300}
Hot threads at 2018-07-16T13:31:01.540Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (88.4micros out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:748)

New Hot threads
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
102 716 102 716 0 0 1178 0 --:--:-- --:--:-- --:--:-- 1179
::: {EDEL_PROD}{zdCNenBjT2GzgAqzcAlOQw}{10.0.2.109}{10.0.2.109:9300}
Hot threads at 2018-07-16T13:32:01.161Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (86.5micros out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:748)

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0New Hot threads

102 716 102 716 0 0 1178 0 --:--:-- --:--:-- --:--:-- 1177
102 716 102 716 0 0 1178 0 --:--:-- --:--:-- --:--:-- 1177
::: {EDEL_PROD}{zdCNenBjT2GzgAqzcAlOQw}{10.0.2.109}{10.0.2.109:9300}
Hot threads at 2018-07-16T13:33:01.777Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (77.4micros out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:748)

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
102 717 102 717 0 0 1180 0 --:--:-- --:--:-- --:--:-- 1181
::: {EDEL_PROD}{zdCNenBjT2GzgAqzcAlOQw}{10.0.2.109}{10.0.2.109:9300}
Hot threads at 2018-07-16T13:34:01.399Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (107.3micros out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:748)

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
102 716 102 716 0 0 1175 0 --:--:-- --:--:-- --:--:-- 1175
102 716 102 716 0 0 1175 0 --:--:-- --:--:-- --:--:-- 1175
::: {EDEL_PROD}{zdCNenBjT2GzgAqzcAlOQw}{10.0.2.109}{10.0.2.109:9300}
Hot threads at 2018-07-16T13:35:01.024Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (75.9micros out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:748)

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
102 716 102 716 0 0 1179 0 --:--:-- --:--:-- --:--:-- 1177
102 716 102 716 0 0 1178 0 --:--:-- --:--:-- --:--:-- 1177
::: {EDEL_PROD}{zdCNenBjT2GzgAqzcAlOQw}{10.0.2.109}{10.0.2.109:9300}
Hot threads at 2018-07-16T13:36:01.648Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (87.8micros out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:748)

Based on these dumps it doesn't look like the node is very busy at all. You need to capture the hot three dumps when the node is using 100% CPU.

Please find the response below when the CPU spiked to 100% today,

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
102 716 102 716 0 0 1181 0 --:--:-- --:--:-- --:--:-- 1183
::: {EDEL_PROD}{zdCNenBjT2GzgAqzcAlOQw}{10.0.2.109}{10.0.2.109:9300}
Hot threads at 2018-07-17T04:51:41.204Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

0.0% (63.5micros out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][transport_client_timer][T#1]{Hashed wheel timer #1}'
 10/10 snapshots sharing following 5 elements
   java.lang.Thread.sleep(Native Method)
   org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
   org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
   org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   java.lang.Thread.run(Thread.java:748)

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0::: {EDEL_PROD}{zdCNenBjT2GzgAqzcAlOQw}{10.0.2.109}{10.0.2.109:9300}
Hot threads at 2018-07-17T05:07:47.064Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

100.1% (500.5ms out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][search][T#10]'
10/10 snapshots sharing following 34 elements
org.apache.lucene.util.automaton.Operations.determinize(Operations.java:777)
org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1165)
org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:238)
org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:103)
org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:81)
org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:65)
org.apache.lucene.search.WildcardQuery.(WildcardQuery.java:57)
org.elasticsearch.index.query.WildcardQueryParser.parse(WildcardQueryParser.java:104)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.BoolQueryParser.parse(BoolQueryParser.java:82)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.BoolQueryParser.parse(BoolQueryParser.java:120)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.BoolQueryParser.parse(BoolQueryParser.java:76)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.BoolQueryParser.parse(BoolQueryParser.java:76)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.IndexQueryParserService.innerParse(IndexQueryParserService.java:324)
org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:224)
org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:219)
org.elasticsearch.search.query.QueryParseElement.parse(QueryParseElement.java:33)
org.elasticsearch.search.SearchService.parseSource(SearchService.java:856)
org.elasticsearch.search.SearchService.createContext(SearchService.java:667)
org.elasticsearch.search.SearchService.createAndPutContext(SearchService.java:633)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:377)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:378)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
java.lang.Thread.run(Thread.java:748)

100.0% (499.8ms out of 500ms) cpu usage by thread 'elasticsearch[EDEL_PROD][search][T#12]'
2/10 snapshots sharing following 37 elements
java.util.TreeMap.getEntry(TreeMap.java:359)
java.util.TreeMap.get(TreeMap.java:278)
org.apache.lucene.util.automaton.SortedIntSet.incr(SortedIntSet.java:53)
org.apache.lucene.util.automaton.Operations.determinize(Operations.java:777)
org.apache.lucene.util.automaton.Operations.getCommonSuffixBytesRef(Operations.java:1165)
org.apache.lucene.util.automaton.CompiledAutomaton.(CompiledAutomaton.java:238)
org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:103)
org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:81)
org.apache.lucene.search.AutomatonQuery.(AutomatonQuery.java:65)
org.apache.lucene.search.WildcardQuery.(WildcardQuery.java:57)
org.elasticsearch.index.query.WildcardQueryParser.parse(WildcardQueryParser.java:104)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.BoolQueryParser.parse(BoolQueryParser.java:82)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.BoolQueryParser.parse(BoolQueryParser.java:120)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.BoolQueryParser.parse(BoolQueryParser.java:76)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.BoolQueryParser.parse(BoolQueryParser.java:76)
org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:250)
org.elasticsearch.index.query.IndexQueryParserService.innerParse(IndexQueryParserService.java:324)
org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:224)
org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:219)

It seems to be running some very heavy or expensive searches that are taking up all CPU. What type of queries are you using? What is the output of the cat indices API?

As far as queries are concerned we haven't modified anything, we have simply downloaded the bundle from Elastic site and did some basic configuration in elasticsearch.yml as below,

network.host: 10.0.2.109
cluster.name: EDEL_PROD_ES
http.cors.enabled: true
node.name: EDEL_PROD
http.cors.allow-origin : "*"

And for cat indices API I will share the result ASAP.

cat indices API results the below,

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 79 0 79 0 0 60536 0 --:--:-- --:--:-- --:--:-- 79000
health status index pri rep docs.count docs.deleted store.size pri.store.size

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.