Sampling aggregation with a fixed seed producing unstable results

We would like to use sampling aggregations to improve aggregation performance in some dashboards, but the results must be the same each time to avoid confusing customers. We are thus setting the seed since if we do that according to the docs "the random subset of documents is the same between calls".

We are finding this doesn't seem to entirely work in practice.

I have reproduced this in this search:

curl -XPOST 'http://localhost:9200/my_index/_search?size=0&pretty' -H 'Content-Type: application/json' -d '
{
  "query": {
    "bool": {
      "filter": [{
        "range": {
          "time": {
            "gte": "2023-04-11",
            "lt": "2023-04-12"
          }
        }
      }]
    }
  },
  "aggs": {
    "sampling": {
      "random_sampler": {
        "probability": 0.1,
        "seed": 1
      },
      "aggs": {
        "allCount": { "sum": { "field": "count" } }
      }
    }
  }
}
'

If I run this repeatedly most of the time this gives me:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "sampling" : {
      "seed" : 1,
      "probability" : 0.1,
      "doc_count" : 52515,
      "allCount" : {
        "value" : 5164490.0
      }
    }
  }
}

But every so often it gives me a different result:

{
  "took" : 8,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "sampling" : {
      "seed" : 1,
      "probability" : 0.1,
      "doc_count" : 52341,
      "allCount" : {
        "value" : 5173150.0
      }
    }
  }
}

The index in question a green and has no recent write activity so shouldn't be changing.

Am I doing something silly or is this potentially a bug?

Our ES is a five node cluster running 8.5.2.

Thanks,

Geoff

1 Like

For anybody else who runs into this the answer appears to be that that the result is dependent on a combination of the seed and the shard so you need to set a shard preference on the search to get a fully stable result.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.