100% peak of processing in all nodes

pjpizarro · March 2, 2017, 4:07pm

Good Morning,

All nodes (11 data nodes) were 100% cpu after due to some query (or queries) and did not return to normal after 10 minutes.

Monitoring:
https://snapshot.raintank.io/dashboard/snapshot/Vvcn2sxInYPjMoFWtAJv2P0jMg5Bm04R

In the log it was found that some of the queries were spatial.

Using the command "GET _tasks?actions=*search&detailed" we identified that there were more than 500 pending tasks.

We tried to cancel these tasks but we did not succeed.

POST _tasks/_cancel?nodes=MXdSDhr6TaSz3zyfd4GHNA&actions=*search
POST _tasks/_cancel?actions=*search

We then decided to restart all nodes following the rolling update procedure.

This took around 1 hour.

After that everything returned to normal.

ElasticSearch Cluster:
version 2.4.1
3 master
11 data nodes

indices:
green open XXXXXXXXX 20 1 805805 61389 720mb 362.6mb
green open XXXXXXXXX 20 1 225114047 63873085 187.4gb 92.8gb
green open XXXXXXXXX 20 1 241362 0 476.1mb 238mb
green open XXXXXXXXX 20 1 81916 0 26.1mb 13mb
green open XXXXXXXXX 20 0 7746 287 1.1gb 1.1gb
green open XXXXXXXXX 20 1 2817049 494145 2.4gb 1.1gb
green open XXXXXXXXX 20 1 2479279875 1730805009 5.8tb 2.9tb
green open XXXXXXXXX 05 1 13734 11 17.3gb 8.6gb
green open XXXXXXXXX 20 1 63258 458 51.1mb 25.3mb
green open XXXXXXXXX 20 1 975508 169956 963.1mb 484.7mb
green open XXXXXXXXX 20 1 14898463 666475 8.9gb 4.4gb
green open XXXXXXXXX 20 1 1078516 339622 2.3gb 1.1gb
green open XXXXXXXXX 20 1 134646704 23321177 947.6gb 477.1gb
green open XXXXXXXXX 20 1 5898478 1474837 5.6gb 2.8gb
green open XXXXXXXXX 20 1 6036864082 1180357 2.2tb 1.1tb
green open XXXXXXXXX 20 1 1705705965 616230738 2.3tb 1.1tb
green open XXXXXXXXX 20 1 4601595 0 7.5gb 3.7gb
green open XXXXXXXXX 20 1 451058 0 217.8mb 108.9mb

GET /_cluster/settings
{
"persistent": {},
"transient": {
"cluster": {
"routing": {
"allocation": {
"enable": "all"
}
}
},
"threadpool": {
"bulk": {
"queue_size": "500"
}
}
}
}

Could someone help me figure out what could have caused this?
How can we find out?
What to do next time?

Thanks in advanced.

pjpizarro · March 4, 2017, 1:42pm

The problem happened again.
We were able to isolate the query that causes all shards to be unuseless.

Is a spatial query with more than 60 thousand points.

We can see that there are several hot threads running spatial search code:

::: {es-production-data-ip-10-11-31-214}{u6RzXri_SbWafPxOLz7fhg}{10.11.31.214}{10.11.31.214:9300}{master=false}
Hot threads at 2017-03-03T13:45:07.451Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

99.8% (498.9ms out of 500ms) cpu usage by thread 'elasticsearch[es-production-data-ip-10-11-31-214][search][T#3]'
10/10 snapshots sharing following 33 elements
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:234)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:212)
com.spatial4j.core.shape.ShapeCollection.relate(ShapeCollection.java:120)
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:165)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery$1.visitPrefix(WithinPrefixTreeQuery.java:176)
org.apache.lucene.spatial.prefix.AbstractVisitingPrefixTreeQuery$VisitorTemplate.getDocIdSet(AbstractVisitingPrefixTreeQuery.java:187)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery.getDocIdSet(WithinPrefixTreeQuery.java:150)
org.apache.lucene.spatial.prefix.AbstractPrefixTreeQuery$1.scorer(AbstractPrefixTreeQuery.java:85)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:628)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:628)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:655)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:269)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:818)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:384)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:385)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:293)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

99.8% (498.7ms out of 500ms) cpu usage by thread 'elasticsearch[es-production-data-ip-10-11-31-214][search][T#1]'
2/10 snapshots sharing following 34 elements
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:157)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:234)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:212)
com.spatial4j.core.shape.ShapeCollection.relate(ShapeCollection.java:120)
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:165)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery$1.visitPrefix(WithinPrefixTreeQuery.java:176)
org.apache.lucene.spatial.prefix.AbstractVisitingPrefixTreeQuery$VisitorTemplate.getDocIdSet(AbstractVisitingPrefixTreeQuery.java:187)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery.getDocIdSet(WithinPrefixTreeQuery.java:150)
org.apache.lucene.spatial.prefix.AbstractPrefixTreeQuery$1.scorer(AbstractPrefixTreeQuery.java:85)

warkolm · March 5, 2017, 7:22pm

You cannot cancel searches at this time.

Are you able to upgrade to 5.X, geo performance has been significantly improved

pjpizarro · March 7, 2017, 12:14pm

Thanks Mark for your reply.

If I add a client node could this help?

Would this spatial processing run on the client node or the data node?

While we are working on the migration to ES5.

warkolm · March 8, 2017, 7:37pm

These queries are handled like any others, so a client node may help, it really depends on what sort of queries they are.

pjpizarro · March 13, 2017, 11:44am

I did the same spatial search on ES5 and it still took 7 hours.

Checking my mapping I found that we setted geo_shape with precision = 10m and by documentation the value of distance_error_pct will be set to 0.

I kept the precision value at 10m but changed the distance_error_pct to 0.025. Now the same space search took only 3 minutes.

It is clear that there is a compromise between precision and performance.

Is there documentation that best explains the relationship between "precison / tree_levels" and "distance_error_pct"? The geo_shape documentation does not explain well.

Thanks

warkolm · March 14, 2017, 12:37am

Yep.

If there's nothing there, then no
PRs/issues welcome!

system · April 11, 2017, 12:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
100% peak of processing in one of the machines Elasticsearch	5	704	March 30, 2017
100% CPU accross all 8 nodes during query Elasticsearch	5	1935	April 13, 2020
100% CPU on 1 Node with JMeter Tests Elasticsearch	4	1774	July 6, 2017
Two specific nodes with 100% CPU Elasticsearch	4	402	January 12, 2021
Need help to overcome 100% CPU Elasticsearch	18	14264	May 25, 2017

100% peak of processing in all nodes

Related topics