Very high CPU usage on one Elasticsearch data node

Hi,
I'm having problems where sometimes one of my serving data nodes starts having 100% CPU usage while others don't.
Restarting it solves the issue for a while.
There was no rise in traffic.

My ES cluster has 8 data nodes, replication factor of 3 (i.e data is replicated on 4 different nodes). I have 2 client nodes. There is no data indexing on this node. The data is transferred by moving shards from the indexing zone to the serving zone (the node is dedicated for serving only)

My hot threads (printed after I restarted the overloaded node):

::: [data-srv-rcom-es-node-a3241f0][FTwN65u4Q8eVxhmaJJr5og][data-srv-rcom-es-node-a3241f0.use.dynamicyield.com][inet[/172.30.0.139:9300]]{availability_zone=us-east-1a, aws_availability_zone=us-east-1a, max_local_storage_nodes=1, zone=SERVER, master=false}
   
   48.3% (241.7ms out of 500ms) cpu usage by thread 'elasticsearch[data-srv-rcom-es-node-a3241f0][search][T#18]'
     10/10 snapshots sharing following 10 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
       java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
       org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1075)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1137)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
       java.lang.Thread.run(Thread.java:748)
   
   45.2% (225.8ms out of 500ms) cpu usage by thread 'elasticsearch[data-srv-rcom-es-node-a3241f0][search][T#17]'
     10/10 snapshots sharing following 19 elements
       org.elasticsearch.common.lucene.search.function.FiltersFunctionScoreQuery$CustomBoostFactorScorer.score(FiltersFunctionScoreQuery.java:332)
       org.apache.lucene.search.TopScoreDocCollector$OutOfOrderTopScoreDocCollector.collect(TopScoreDocCollector.java:140)
       org.apache.lucene.search.Weight$DefaultBulkScorer.scoreAll(Weight.java:193)
       org.apache.lucene.search.Weight$DefaultBulkScorer.score(Weight.java:163)
       org.apache.lucene.search.BulkScorer.score(BulkScorer.java:35)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:621)
       org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:191)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:491)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:448)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:281)
       org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:269)
       org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:157)
       org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:362)
       org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:833)
       org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryFetchTransportHandler.messageReceived(SearchServiceTransportAction.java:824)
       org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
       java.lang.Thread.run(Thread.java:748)
   
    6.2% (30.7ms out of 500ms) cpu usage by thread 'elasticsearch[data-srv-rcom-es-node-a3241f0][search][T#16]'
     10/10 snapshots sharing following 10 elements
       sun.misc.Unsafe.park(Native Method)
       java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
       java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:735)
       java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
       java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1137)
       org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162)
       java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1075)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1137)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
       java.lang.Thread.run(Thread.java:748)

The node that had 100 CPU was data-srv-rcom-es-node-a3241f0.use

  1. Any ideas how can I find out what's the cause for it?
  2. How can I spread the load better? Creating more data nodes didn't fix this.
  3. Is there a way I can verify if my caching should be more extensive?
  4. Will scaling the data nodes up do the trick?

I also attached an image showing how the CPU load rises to 100% until I restart the elasticsearch process on that node

our elasticsearch version is 1.4.3.
data nodes have 8 CPUs, and 30 GB RAM
ES heap size is 20g. I don't see the heap is full or any stress in gc.

with the following attributes:

indices.cache.filter.size: 10%
indices.fielddata.cache.size: 50%
indices.memory.index_buffer_size: 30%

@Igor_Motov or @Christian_Dahlqvist, maybe one of you guys can help?

Please do not ping people like that unless they are already involved in the thread. This forum is manned by volunteers, so give it a day or two before bumping the thread.

3 Likes

Hi,

Do you run updates on your documents ? Or have running indexations, or only searchs ? Is this node master ?

bye,
Xavier

There are about 942 indexes in the cluster. For each index there are 3 more replicas. There are 2 client nodes, 3 data indexing nodes, 8 data serving nodes, , and 3 master nodes. Each node has only one purpose. The data indexing is done on the indexing nodes, and on finish, the shards are being moved to the serving nodes. So no indexing is done on the data serving nodes, just searches.

1000 indices in a 8 nodes cluster, I hope they are not too big... First I'll recommende to upgrade your cluster because 1.4.3 is pretty old.

Do you have noticed that this CPU load appears when indices are copied from the index nodes to rest of the cluster, or it when a dump, scroll or a backup is running ? No special relation with a backoffice program ? Is it always the same nodes having the CPU load ?

+1 to what @xavierfacq said. Specifically if you are using the default settings with 5 shards and 1 replicas, you most likely have 2000 shards on 8 nodes which is probably too much.

May I suggest you look at the following resources about sizing:

https://www.elastic.co/elasticon/conf/2016/sf/quantitative-cluster-sizing

1 Like

@dadoonet, We have one shard per index with replication factor of 3. We have 10 indexes with ~3 million documents (4.5 GB size). Most of our other indexes have between 10k-200k documents. The thing I can't understand is why all of the CPU utilization happens on one data node if there is replication.
all of our search queries are not aggregative, but we do use function_score heavilly.

@xavierfacq, basically we move shards between nodes all of the time (from indexing to serving). I'll try to check if the issue happens at times of moving large shards

Which node(s) your client application is connecting to?

There are 3 serving client nodes, which are behind an application-load-balancer. The requests are from http clients and I verified that data nodes don't receive any http traffic.

Sounds like you are using the function score API which indeed might increase the work to do but I don't have any idea why a given data node is using more CPUs than the others. May be this node has more shards that are hit by the query?

FWIW I'd recommend first decreasing the number of primary shards if your use case allows.
And may be the number of replicas as well if you don't really need 4 copies.

So just so that I understand:

  1. Are you saying that the number of replicas doesn't improve search query load distribution among nodes?

  2. If for each index in my cluster I have 1 shard that is replicated to 3 more (4 shards in total that hold the entire index), does the load distribute between these shards or does only one shard per index get queried?

  3. I saw that the slowlog was printing search queries that ran on our larger indexes (2-3 million documents).
    In that case, would splitting the index to different data shards make the load distribute better?

Thanks

@dadoonet, WDYT?

Is load balancing implemented differently in newer Elasticsearch versions?

I don't know. May be @Christian_Dahlqvist has seen that?

I am not aware of any changes in load balancing. I do however have a few additional questions.

Are you by any chance using preference for queries?

If you just look at the larger indices, how are shards distributed across the nodes? Do the nodes with higher CPU usage by any chance have more of these shards?

I also vaguely recall from previous conversations that you were considering running updates in parallel to the full rebuilding of indices. If this is the case, do the nodes with higher CPU usage have more and/or larger primary shards than the other nodes?

Hi,

Are you by any chance using preference for queries?

We don't use preference for queries, so by your question, I get that ES should randomly distribute the load between the shards. Am I right?

If you just look at the larger indices, how are shards distributed across the nodes? Do the nodes with higher CPU usage by any chance have more of these shards?

The nodes with the higher CPU did have more documents and shards, since we scaled out our cluster, and it took time until shards were moved.

Is there a way to balance load by number of documents?

I also vaguely recall from previous conversations that you were considering running updates in parallel to the full rebuilding of indices. If this is the case, do the nodes with higher CPU usage have more and/or larger primary shards than the other nodes?

we don't run updates in parallel yet.

I saw that our heaviest queries contain function_score with weights we give for different fields (non aggregative but still heavy). Will splitting the index into multiple shards improve the performance and load distribution?

Yes

Then I guess it is possible it is your function scoring might be causing this.

Not that I am aware of.

It might, but you already have a lot of shards, so it could also end up causing more problems. Depending on what your distribution looks like, it might be worthwhile trying out having a few more primary shards just for the largest indices and see what effect that has. As you are reindexing frequently you could easily change back.

Update:
Increasing the number of primary shards reduced the latency by almost 50%, and we didn't encounter a CPU overload since.

Next step is to test reducing the number of replicas shards (which is now set to 3)

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.