How to generate the same sampled dataset from spark?

geantbrun · April 11, 2017, 3:10pm

Hi,
In Spark, I want to sample a large dataset from my es index. I tried the following command:

df = spark.read.format("org.elasticsearch.spark.sql") \
            .option("query", myquery) \
            .option("pushdown", "true") \
            .load(spark.conf.get("spark.es.resource")) \
            .limit(mysize)

where myquery is:

{
  "query": {
    "filtered": {
      "query": {
        "function_score": {
          "functions": [
            {
              "random_score": {
                "seed": 1
              }
            }
          ],
          "score_mode": "sum"
        }
      }
    }
  }
}

amd mysize~10K. The problem is that I'm not getting the same result each time I execute the command (with the same seed). Why? What is the best way to sample the data without having first to match all the data and making a sample command on this huge dataframe?

Thank you for your help

system · May 9, 2017, 3:18pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.