I am using ElasticSearch 5.4.1, Spark 2.1.1 and Elasticsearch Hadoop 5.4.2. I have three nodes with ~250 indices with ~750 shards (no replication) containing ~1.7 billion documents.
If I try to query one day worth of data (~5m documents) in pyspark
docs = spark.read\ .format('org.elasticsearch.spark.sql')\ .option('pushdown', True)\ .option('double.filtering', True)\ .load('index-name-*/type') docs\ .filter(docs.timestamp > datetime.datetime(2017, 5, 1, 0, 0))\ .filter(docs.timestamp <= datetime.datetime(2017, 5, 2, 0, 0))\ .count() # solely for demonstration
it takes virtually forever (>30m ?). The solution I found is to increase the "es.scroll.size" from its default value of 50 to the default setting of "index.max_result_window" of 10000. The operation then is completed in < 200s!
Is there any particular reason for the default setting of 50? Maybe this could be updated in the documentation?