Hi,
I am using ElasticSearch 5.4.1, Spark 2.1.1 and Elasticsearch Hadoop 5.4.2. I have three nodes with ~250 indices with ~750 shards (no replication) containing ~1.7 billion documents.
If I try to query one day worth of data (~5m documents) in pyspark
docs = spark.read\
.format('org.elasticsearch.spark.sql')\
.option('pushdown', True)\
.option('double.filtering', True)\
.load('index-name-*/type')
docs\
.filter(docs.timestamp > datetime.datetime(2017, 5, 1, 0, 0))\
.filter(docs.timestamp <= datetime.datetime(2017, 5, 2, 0, 0))\
.count() # solely for demonstration
it takes virtually forever (>30m ?). The solution I found is to increase the "es.scroll.size" from its default value of 50 to the default setting of "index.max_result_window" of 10000. The operation then is completed in < 200s!
Is there any particular reason for the default setting of 50? Maybe this could be updated in the documentation?