100% peak of processing in all nodes

Good Morning,

All nodes (11 data nodes) were 100% cpu after due to some query (or queries) and did not return to normal after 10 minutes.

Monitoring:
https://snapshot.raintank.io/dashboard/snapshot/Vvcn2sxInYPjMoFWtAJv2P0jMg5Bm04R

In the log it was found that some of the queries were spatial.

Using the command "GET _tasks?actions=*search&detailed" we identified that there were more than 500 pending tasks.

We tried to cancel these tasks but we did not succeed.

POST _tasks/_cancel?nodes=MXdSDhr6TaSz3zyfd4GHNA&actions=*search
POST _tasks/_cancel?actions=*search

We then decided to restart all nodes following the rolling update procedure.

This took around 1 hour.

After that everything returned to normal.

ElasticSearch Cluster:
version 2.4.1
3 master
11 data nodes

indices:
green open XXXXXXXXX 20 1 805805 61389 720mb 362.6mb
green open XXXXXXXXX 20 1 225114047 63873085 187.4gb 92.8gb
green open XXXXXXXXX 20 1 241362 0 476.1mb 238mb
green open XXXXXXXXX 20 1 81916 0 26.1mb 13mb
green open XXXXXXXXX 20 0 7746 287 1.1gb 1.1gb
green open XXXXXXXXX 20 1 2817049 494145 2.4gb 1.1gb
green open XXXXXXXXX 20 1 2479279875 1730805009 5.8tb 2.9tb
green open XXXXXXXXX 05 1 13734 11 17.3gb 8.6gb
green open XXXXXXXXX 20 1 63258 458 51.1mb 25.3mb
green open XXXXXXXXX 20 1 975508 169956 963.1mb 484.7mb
green open XXXXXXXXX 20 1 14898463 666475 8.9gb 4.4gb
green open XXXXXXXXX 20 1 1078516 339622 2.3gb 1.1gb
green open XXXXXXXXX 20 1 134646704 23321177 947.6gb 477.1gb
green open XXXXXXXXX 20 1 5898478 1474837 5.6gb 2.8gb
green open XXXXXXXXX 20 1 6036864082 1180357 2.2tb 1.1tb
green open XXXXXXXXX 20 1 1705705965 616230738 2.3tb 1.1tb
green open XXXXXXXXX 20 1 4601595 0 7.5gb 3.7gb
green open XXXXXXXXX 20 1 451058 0 217.8mb 108.9mb

GET /_cluster/settings
{
"persistent": {},
"transient": {
"cluster": {
"routing": {
"allocation": {
"enable": "all"
}
}
},
"threadpool": {
"bulk": {
"queue_size": "500"
}
}
}
}

Could someone help me figure out what could have caused this?
How can we find out?
What to do next time?

Thanks in advanced.

The problem happened again.
We were able to isolate the query that causes all shards to be unuseless.

Is a spatial query with more than 60 thousand points.

We can see that there are several hot threads running spatial search code:

::: {es-production-data-ip-10-11-31-214}{u6RzXri_SbWafPxOLz7fhg}{10.11.31.214}{10.11.31.214:9300}{master=false}
Hot threads at 2017-03-03T13:45:07.451Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:

99.8% (498.9ms out of 500ms) cpu usage by thread 'elasticsearch[es-production-data-ip-10-11-31-214][search][T#3]'
10/10 snapshots sharing following 33 elements
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:234)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:212)
com.spatial4j.core.shape.ShapeCollection.relate(ShapeCollection.java:120)
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:165)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery$1.visitPrefix(WithinPrefixTreeQuery.java:176)
org.apache.lucene.spatial.prefix.AbstractVisitingPrefixTreeQuery$VisitorTemplate.getDocIdSet(AbstractVisitingPrefixTreeQuery.java:187)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery.getDocIdSet(WithinPrefixTreeQuery.java:150)
org.apache.lucene.spatial.prefix.AbstractPrefixTreeQuery$1.scorer(AbstractPrefixTreeQuery.java:85)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:628)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:628)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:655)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:269)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:818)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:384)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:385)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:293)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

99.8% (498.7ms out of 500ms) cpu usage by thread 'elasticsearch[es-production-data-ip-10-11-31-214][search][T#1]'
2/10 snapshots sharing following 34 elements
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:157)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:234)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:212)
com.spatial4j.core.shape.ShapeCollection.relate(ShapeCollection.java:120)
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:165)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery$1.visitPrefix(WithinPrefixTreeQuery.java:176)
org.apache.lucene.spatial.prefix.AbstractVisitingPrefixTreeQuery$VisitorTemplate.getDocIdSet(AbstractVisitingPrefixTreeQuery.java:187)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery.getDocIdSet(WithinPrefixTreeQuery.java:150)
org.apache.lucene.spatial.prefix.AbstractPrefixTreeQuery$1.scorer(AbstractPrefixTreeQuery.java:85)

You cannot cancel searches at this time.

Are you able to upgrade to 5.X, geo performance has been significantly improved :smiley:

Thanks Mark for your reply.

If I add a client node could this help?

Would this spatial processing run on the client node or the data node?

While we are working on the migration to ES5.

These queries are handled like any others, so a client node may help, it really depends on what sort of queries they are.

I did the same spatial search on ES5 and it still took 7 hours.

Checking my mapping I found that we setted geo_shape with precision = 10m and by documentation the value of distance_error_pct will be set to 0.

I kept the precision value at 10m but changed the distance_error_pct to 0.025. Now the same space search took only 3 minutes.

It is clear that there is a compromise between precision and performance.

Is there documentation that best explains the relationship between "precison / tree_levels" and "distance_error_pct"? The geo_shape documentation does not explain well.

Thanks

Yep.

If there's nothing there, then no :wink:
PRs/issues welcome!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.