Sampler aggregation performance vs 2 queries

adam3 · December 20, 2017, 5:52pm

Ok so I am trying to increase performance of an Elasticsearch application.

For this case there is a single Elasticsearch node (non-sharded) with about 7M docs (index is like 40G).

We were doing a query to get top docs (like 1000) and then another query to do aggregation on some fields on those docs filter ids values (those 1000 docs) and then aggregate mutual_information for those terms.

I had thought that doing a sampler aggregation would help by doing 1 query instead of 2 and that this would speed things up, but not seeing that currently.

Ok so here's the orig query:
{
"from" : 0,
"size" : 1000,
"query" : {
"bool" : {
"should" : [
{
"match" : {
"name" : {
"query" : "workout",
"operator" : "AND",
"prefix_length" : 0,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"boost" : 1.0
}
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"min_score" : 7.0,
"_source" : false,
"sort" : [
{
"_score" : {
"order" : "desc"
}
},
{
"mau" : {
"order" : "desc"
}
}
]
}

followed by:
{
"from" : 0,
"size" : 1000,
"query" : {
"bool" : {
"filter" : [
{
"ids" : {
"type" : [ ],
"values" : [
"AWBNMzCn5eVrMnnV89Iw",
.
.
. (lots of these)
],
"boost" : 1.0
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"sort" : [
{
"_score" : {
"order" : "desc"
}
},
{
"mau" : {
"order" : "desc"
}
}
],
"aggregations" : {
"tracks" : {
"significant_terms" : {
"field" : "tracks.raw",
"size" : 1000,
"min_doc_count" : 2,
"shard_min_doc_count" : 0,
"mutual_information" : {
"include_negatives" : false,
"background_is_superset" : true
}
}
}
}
}

vs:
{
"from" : 0,
"size" : 0,
"query" : {
"bool" : {
"should" : [
{
"match" : {
"name" : {
"query" : "workout",
"operator" : "AND",
"prefix_length" : 0,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"boost" : 1.0
}
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"min_score" : 7.0,
"_source" : false,
"sort" : [
{
"_score" : {
"order" : "desc"
}
},
{
"mau" : {
"order" : "desc"
}
}
],
"aggregations" : {
"tracks" : {
"sampler" : {
"shard_size" : 1000
},
"aggregations" : {
"tracks" : {
"significant_terms" : {
"field" : "tracks.raw",
"size" : 1000,
"min_doc_count" : 2,
"shard_min_doc_count" : 2,
"mutual_information" : {
"include_negatives" : false,
"background_is_superset" : true
}
}
}
}
}
}
}

Does that make sense?

Thanks,

Adam

Mark_Harwood · December 20, 2017, 8:55pm

I would have thought sampler agg should be faster. Maybe the bulk of the time is spent in the expensive loop that is common to both which is looking up the background frequency of all the terms found in matching docs. How many terms are in each doc?

adam3 · December 20, 2017, 9:44pm

Mark,

There are a bunch of terms per doc.

I guess there are about 65 per document.

I can see some hot threads if that helps...

9/10 snapshots sharing following 29 elements
   org.elasticsearch.search.aggregations.bucket.significant.GlobalOrdinalsSignificantTermsAggregator.buildAggregation(GlobalOrdinalsSignificantTermsAggregator.java:104)
   org.elasticsearch.search.aggregations.bucket.significant.GlobalOrdinalsSignificantTermsAggregator$WithHash.buildAggregation(GlobalOrdinalsSignificantTermsAggregator.java:158)
   org.elasticsearch.search.aggregations.AggregatorFactory$MultiBucketAggregatorWrapper.buildAggregation(AggregatorFactory.java:147)
   org.elasticsearch.search.aggregations.bucket.DeferringBucketCollector$WrappedAggregator.buildAggregation(DeferringBucketCollector.java:96)
   org.elasticsearch.search.aggregations.bucket.BucketsAggregator.bucketAggregations(BucketsAggregator.java:116)
   org.elasticsearch.search.aggregations.bucket.sampler.SamplerAggregator.buildAggregation(SamplerAggregator.java:171)
   org.elasticsearch.search.aggregations.AggregationPhase.execute(AggregationPhase.java:139)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
   org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$16(IndicesService.java:1108)
   org.elasticsearch.indices.IndicesService$$Lambda$1799/1783108370.accept(Unknown Source)
   org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$18(IndicesService.java:1189)
   org.elasticsearch.indices.IndicesService$$Lambda$1803/1750113915.get(Unknown Source)
   org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:160)
   org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:143)
   org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:398)
   org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:116)
   org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1195)
   org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1107)
   org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:245)
   org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
   org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:331)
   org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:328)
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
   org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:618)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:613)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   java.lang.Thread.run(Thread.java:748)

Thanks so much,

Adam

Mark_Harwood · December 20, 2017, 11:24pm

There are a number of execution modes when it comes to gathering terms. Try the "map" execution hint. See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_execution_hint_2

adam3 · December 21, 2017, 2:50pm

Mark,

That tweak seems to have had a tremendous positive effect.

Thanks so much!!

Adam

system · January 18, 2018, 2:50pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Sampler aggregation fails to optimize queries Elasticsearch	8	414	November 4, 2019
Performance aggregations vs collapsing Elasticsearch	32	9855	December 13, 2018
Multiple aggregation in one request vs one aggregation per request performance Elasticsearch	3	5302	May 8, 2017
Filter Aggregation vs Msearch Elasticsearch	4	1848	February 21, 2018
What's the best performance, to execute two different queries or a single? Elasticsearch	3	411	April 20, 2018

Sampler aggregation performance vs 2 queries

Related topics