Sampler aggregation fails to optimize queries

eladempow · October 2, 2019, 11:56am

I need to calculate some metrics for a dashboard view of a ~30gb index.
As I understand it, the sampler aggregation can be used to perform faster calculations on a small sample of the data, but the performance I get is abysmal even for a very small sample size, which does not make sense and defeats the purpose of using the sampler aggregation.

Example (runs for 6 seconds):

GET index_prefix*/_search?size=0
{
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 10
      },
      "aggs": {
        "last_month_events": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": "now-30d"
              }
            }
          }
        }
      }
    }
  }
}

Am I missing something here? Is it possible to achieve good performance for this query?

Thanks

Mark_Harwood · October 4, 2019, 9:45am

The sampler aggregation gets the best-scoring docs. In your example request you have no query so there is no notion of "best" - it just iterates over all docs in the index hoping to find the highest scoring docs (they will all score "1" in your example).
You then filter this sample by your date range.

It would make more sense to use the search index and put your range criteria in the query part of the request. This would mean we'd only iterate over docs that match the criteria.

eladempow · October 6, 2019, 8:06am

Sorry, this is the correct query (a uniform sample of documents):

GET index_prefix*/_search?size=0
{
  "query": {
    "function_score": {
      "random_score": {}
    }
  },
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 10
      },
      "aggs": {
        "last_month_events": {
          "filter": {
            "range": {
              "@timestamp": {
                "gte": "now-30d"
              }
            }
          }
        }
      }
    }
  }
}

The performance is still bad.

Mark_Harwood · October 6, 2019, 8:30am

My point re the date criteria being in the query part of the clause still stands.

eladempow · October 6, 2019, 8:46am

Using a simple count query by date range is still too slow (a few seconds).

Mark_Harwood · October 7, 2019, 9:58am

What version of elasticsearch are you running?

eladempow · October 7, 2019, 10:34am

ELK 7.2.1
(As far as I understand, the function score is calculated for the entire index during the aggregation, which takes a long time. Sampling a few documents uniformly should be faster.)

Mark_Harwood · October 7, 2019, 10:35am

Thanks.

Can you share this query JSON?

system · November 4, 2019, 10:35am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Optimizing using sample aggregation Elasticsearch	9	436	October 17, 2019
Sampler aggregation performance vs 2 queries Elasticsearch	5	910	January 18, 2018
Sampler aggregration overhead Elasticsearch	11	515	February 25, 2020
Large buckets aggregations Elasticsearch	3	3036	July 5, 2017
Query Optimization Elasticsearch	2	437	November 4, 2020

Sampler aggregation fails to optimize queries

Related topics