Elasticsearch very high load - 100% cpu

Hi,

We have our production cluster at 100% with load average of 100.

  • JDK 1.7u55_64bits
  • ELS 1.7.5
  • 9 nodes 30GB heap
  • centos 6.5-6.6

in ELS : 1700 indices / 3200 shards / 5B docs / 5 TB
For the requests : Only Kibana 3 with ~50 dashboards (so no aggragetion, only facets)
We have 5-10 request by second
10 000 indexations/second with average document size <1k
The global cpu load is 5% in busy time

In fact the search queue is exhausted and the active search pool is full.
The curent requests never finish.
The indexing still works fine.

We must restart the cluster to resolve the problem.
The problem is now random, everything worked fine for months and now it crashed everyday.

  • Can we put a global request timeout ?
  • Anybody already had this problem ?
  • what can we do apart restarting the cluster ?

Here an extract of the hot_threads

54.2% (270.7ms out of 500ms) cpu usage by thread 'elasticsearch[server1][search][T#14]'
10/10 snapshots sharing following 11 elements
org.elasticsearch.common.xcontent.json.JsonXContentParser.nextToken(JsonXContentParser.java:51)
org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:350)
org.elasticsearch.action.count.TransportCountAction.shardOperation(TransportCountAction.java:187)
org.elasticsearch.action.count.TransportCountAction.shardOperation(TransportCountAction.java:66)
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:338)
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:324)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

50.9% (254.2ms out of 500ms) cpu usage by thread 'elasticsearch[server1][search][T#6]'
10/10 snapshots sharing following 11 elements
org.elasticsearch.common.xcontent.json.JsonXContentParser.nextToken(JsonXContentParser.java:51)
org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:350)
org.elasticsearch.action.count.TransportCountAction.shardOperation(TransportCountAction.java:187)
org.elasticsearch.action.count.TransportCountAction.shardOperation(TransportCountAction.java:66)
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:338)
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:324)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)

50.6% (252.8ms out of 500ms) cpu usage by thread 'elasticsearch[server1][search][T#21]'
10/10 snapshots sharing following 13 elements
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser._skipWSOrEnd(UTF8StreamJsonParser.java:2728)
org.elasticsearch.common.jackson.core.json.UTF8StreamJsonParser.nextToken(UTF8StreamJsonParser.java:652)
org.elasticsearch.common.xcontent.json.JsonXContentParser.nextToken(JsonXContentParser.java:51)
org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:350)
org.elasticsearch.action.count.TransportCountAction.shardOperation(TransportCountAction.java:187)
org.elasticsearch.action.count.TransportCountAction.shardOperation(TransportCountAction.java:66)
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:338)
org.elasticsearch.action.support.broadcast.TransportBroadcastOperationAction$ShardTransportHandler.messageReceived(TransportBroadcastOperationAction.java:324)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:279)

thx for any help

What type of data do you have in the cluster? How much data do you have in the cluster? How many indices/shards? What is your indexing and query rates? What type of queries do you run?

in ELS : 1700 indices / 3200 shards / 5B docs / 5 TB
For the requests : Only Kibana 3 with ~50 dashboards (so no aggragetion, only facets)
We have 5-10 request by second
10 000 indexations/second with average document size <1k
The global cpu load is 5% in busy time

I'd suggest you have too many shards, which will be adding to heap pressure.