Sampler aggregation performance vs 2 queries


(Adam Chase) #1

Ok so I am trying to increase performance of an Elasticsearch application.

For this case there is a single Elasticsearch node (non-sharded) with about 7M docs (index is like 40G).

We were doing a query to get top docs (like 1000) and then another query to do aggregation on some fields on those docs filter ids values (those 1000 docs) and then aggregate mutual_information for those terms.

I had thought that doing a sampler aggregation would help by doing 1 query instead of 2 and that this would speed things up, but not seeing that currently.

Ok so here's the orig query:
{
"from" : 0,
"size" : 1000,
"query" : {
"bool" : {
"should" : [
{
"match" : {
"name" : {
"query" : "workout",
"operator" : "AND",
"prefix_length" : 0,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"boost" : 1.0
}
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"min_score" : 7.0,
"_source" : false,
"sort" : [
{
"_score" : {
"order" : "desc"
}
},
{
"mau" : {
"order" : "desc"
}
}
]
}

followed by:
{
"from" : 0,
"size" : 1000,
"query" : {
"bool" : {
"filter" : [
{
"ids" : {
"type" : [ ],
"values" : [
"AWBNMzCn5eVrMnnV89Iw",
.
.
. (lots of these)
],
"boost" : 1.0
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"sort" : [
{
"_score" : {
"order" : "desc"
}
},
{
"mau" : {
"order" : "desc"
}
}
],
"aggregations" : {
"tracks" : {
"significant_terms" : {
"field" : "tracks.raw",
"size" : 1000,
"min_doc_count" : 2,
"shard_min_doc_count" : 0,
"mutual_information" : {
"include_negatives" : false,
"background_is_superset" : true
}
}
}
}
}

vs:
{
"from" : 0,
"size" : 0,
"query" : {
"bool" : {
"should" : [
{
"match" : {
"name" : {
"query" : "workout",
"operator" : "AND",
"prefix_length" : 0,
"max_expansions" : 50,
"fuzzy_transpositions" : true,
"lenient" : false,
"zero_terms_query" : "NONE",
"boost" : 1.0
}
}
}
],
"disable_coord" : false,
"adjust_pure_negative" : true,
"boost" : 1.0
}
},
"min_score" : 7.0,
"_source" : false,
"sort" : [
{
"_score" : {
"order" : "desc"
}
},
{
"mau" : {
"order" : "desc"
}
}
],
"aggregations" : {
"tracks" : {
"sampler" : {
"shard_size" : 1000
},
"aggregations" : {
"tracks" : {
"significant_terms" : {
"field" : "tracks.raw",
"size" : 1000,
"min_doc_count" : 2,
"shard_min_doc_count" : 2,
"mutual_information" : {
"include_negatives" : false,
"background_is_superset" : true
}
}
}
}
}
}
}

Does that make sense?

Thanks,

Adam


(Mark Harwood) #2

I would have thought sampler agg should be faster. Maybe the bulk of the time is spent in the expensive loop that is common to both which is looking up the background frequency of all the terms found in matching docs. How many terms are in each doc?


(Adam Chase) #3

Mark,

There are a bunch of terms per doc.

I guess there are about 65 per document.

I can see some hot threads if that helps...

9/10 snapshots sharing following 29 elements
   org.elasticsearch.search.aggregations.bucket.significant.GlobalOrdinalsSignificantTermsAggregator.buildAggregation(GlobalOrdinalsSignificantTermsAggregator.java:104)
   org.elasticsearch.search.aggregations.bucket.significant.GlobalOrdinalsSignificantTermsAggregator$WithHash.buildAggregation(GlobalOrdinalsSignificantTermsAggregator.java:158)
   org.elasticsearch.search.aggregations.AggregatorFactory$MultiBucketAggregatorWrapper.buildAggregation(AggregatorFactory.java:147)
   org.elasticsearch.search.aggregations.bucket.DeferringBucketCollector$WrappedAggregator.buildAggregation(DeferringBucketCollector.java:96)
   org.elasticsearch.search.aggregations.bucket.BucketsAggregator.bucketAggregations(BucketsAggregator.java:116)
   org.elasticsearch.search.aggregations.bucket.sampler.SamplerAggregator.buildAggregation(SamplerAggregator.java:171)
   org.elasticsearch.search.aggregations.AggregationPhase.execute(AggregationPhase.java:139)
   org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:114)
   org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$16(IndicesService.java:1108)
   org.elasticsearch.indices.IndicesService$$Lambda$1799/1783108370.accept(Unknown Source)
   org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$18(IndicesService.java:1189)
   org.elasticsearch.indices.IndicesService$$Lambda$1803/1750113915.get(Unknown Source)
   org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:160)
   org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:143)
   org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:398)
   org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:116)
   org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1195)
   org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1107)
   org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:245)
   org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
   org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:331)
   org.elasticsearch.action.search.SearchTransportService$6.messageReceived(SearchTransportService.java:328)
   org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
   org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:618)
   org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:613)
   org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   java.lang.Thread.run(Thread.java:748)

Thanks so much,

Adam


(Mark Harwood) #4

There are a number of execution modes when it comes to gathering terms. Try the "map" execution hint. See https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html#_execution_hint_2


(Adam Chase) #5

Mark,

That tweak seems to have had a tremendous positive effect.

Thanks so much!!

Adam


(system) #6

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.