Hi,
In Spark, I want to sample a large dataset from my es index. I tried the following command:
df = spark.read.format("org.elasticsearch.spark.sql") \
.option("query", myquery) \
.option("pushdown", "true") \
.load(spark.conf.get("spark.es.resource")) \
.limit(mysize)
where myquery is:
{
"query": {
"filtered": {
"query": {
"function_score": {
"functions": [
{
"random_score": {
"seed": 1
}
}
],
"score_mode": "sum"
}
}
}
}
}
amd mysize~10K. The problem is that I'm not getting the same result each time I execute the command (with the same seed). Why? What is the best way to sample the data without having first to match all the data and making a sample command on this huge dataframe?
Thank you for your help