In the log it was found that some of the queries were spatial.
Using the command "GET _tasks?actions=*search&detailed" we identified that there were more than 500 pending tasks.
We tried to cancel these tasks but we did not succeed.
POST _tasks/_cancel?nodes=MXdSDhr6TaSz3zyfd4GHNA&actions=*search
POST _tasks/_cancel?actions=*search
We then decided to restart all nodes following the rolling update procedure.
This took around 1 hour.
After that everything returned to normal.
ElasticSearch Cluster:
version 2.4.1
3 master
11 data nodes
indices:
green open XXXXXXXXX 20 1 805805 61389 720mb 362.6mb
green open XXXXXXXXX 20 1 225114047 63873085 187.4gb 92.8gb
green open XXXXXXXXX 20 1 241362 0 476.1mb 238mb
green open XXXXXXXXX 20 1 81916 0 26.1mb 13mb
green open XXXXXXXXX 20 0 7746 287 1.1gb 1.1gb
green open XXXXXXXXX 20 1 2817049 494145 2.4gb 1.1gb green open XXXXXXXXX 20 1 2479279875 1730805009 5.8tb 2.9tb
green open XXXXXXXXX 05 1 13734 11 17.3gb 8.6gb
green open XXXXXXXXX 20 1 63258 458 51.1mb 25.3mb
green open XXXXXXXXX 20 1 975508 169956 963.1mb 484.7mb
green open XXXXXXXXX 20 1 14898463 666475 8.9gb 4.4gb
green open XXXXXXXXX 20 1 1078516 339622 2.3gb 1.1gb
green open XXXXXXXXX 20 1 134646704 23321177 947.6gb 477.1gb
green open XXXXXXXXX 20 1 5898478 1474837 5.6gb 2.8gb
green open XXXXXXXXX 20 1 6036864082 1180357 2.2tb 1.1tb green open XXXXXXXXX 20 1 1705705965 616230738 2.3tb 1.1tb
green open XXXXXXXXX 20 1 4601595 0 7.5gb 3.7gb
green open XXXXXXXXX 20 1 451058 0 217.8mb 108.9mb
The problem happened again.
We were able to isolate the query that causes all shards to be unuseless.
Is a spatial query with more than 60 thousand points.
We can see that there are several hot threads running spatial search code:
::: {es-production-data-ip-10-11-31-214}{u6RzXri_SbWafPxOLz7fhg}{10.11.31.214}{10.11.31.214:9300}{master=false}
Hot threads at 2017-03-03T13:45:07.451Z, interval=500ms, busiestThreads=3, ignoreIdleThreads=true:
99.8% (498.9ms out of 500ms) cpu usage by thread 'elasticsearch[es-production-data-ip-10-11-31-214][search][T#3]'
10/10 snapshots sharing following 33 elements
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:234)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:212)
com.spatial4j.core.shape.ShapeCollection.relate(ShapeCollection.java:120)
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:165)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery$1.visitPrefix(WithinPrefixTreeQuery.java:176)
org.apache.lucene.spatial.prefix.AbstractVisitingPrefixTreeQuery$VisitorTemplate.getDocIdSet(AbstractVisitingPrefixTreeQuery.java:187)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery.getDocIdSet(WithinPrefixTreeQuery.java:150)
org.apache.lucene.spatial.prefix.AbstractPrefixTreeQuery$1.scorer(AbstractPrefixTreeQuery.java:85)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:628)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.scorer(LRUQueryCache.java:628)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.scorer(IndicesQueryCache.java:263)
org.apache.lucene.search.BooleanWeight.scorer(BooleanWeight.java:389)
org.apache.lucene.search.Weight.bulkScorer(Weight.java:135)
org.apache.lucene.search.BooleanWeight.bulkScorer(BooleanWeight.java:370)
org.apache.lucene.search.LRUQueryCache$CachingWrapperWeight.bulkScorer(LRUQueryCache.java:655)
org.elasticsearch.indices.cache.query.IndicesQueryCache$CachingWeightWrapper.bulkScorer(IndicesQueryCache.java:269)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:818)
org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:535)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:384)
org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:113)
org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:372)
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:385)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:368)
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:365)
org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:77)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:293)
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)
99.8% (498.7ms out of 500ms) cpu usage by thread 'elasticsearch[es-production-data-ip-10-11-31-214][search][T#1]'
2/10 snapshots sharing following 34 elements
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:157)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:234)
com.spatial4j.core.shape.jts.JtsGeometry.relate(JtsGeometry.java:212)
com.spatial4j.core.shape.ShapeCollection.relate(ShapeCollection.java:120)
com.spatial4j.core.shape.impl.RectangleImpl.relate(RectangleImpl.java:165)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery$1.visitPrefix(WithinPrefixTreeQuery.java:176)
org.apache.lucene.spatial.prefix.AbstractVisitingPrefixTreeQuery$VisitorTemplate.getDocIdSet(AbstractVisitingPrefixTreeQuery.java:187)
org.apache.lucene.spatial.prefix.WithinPrefixTreeQuery.getDocIdSet(WithinPrefixTreeQuery.java:150)
org.apache.lucene.spatial.prefix.AbstractPrefixTreeQuery$1.scorer(AbstractPrefixTreeQuery.java:85)
I did the same spatial search on ES5 and it still took 7 hours.
Checking my mapping I found that we setted geo_shape with precision = 10m and by documentation the value of distance_error_pct will be set to 0.
I kept the precision value at 10m but changed the distance_error_pct to 0.025. Now the same space search took only 3 minutes.
It is clear that there is a compromise between precision and performance.
Is there documentation that best explains the relationship between "precison / tree_levels" and "distance_error_pct"? The geo_shape documentation does not explain well.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.