Correct setting of "es.scroll.size" with for optimal Spark read performance

Hi,

I am using ElasticSearch 5.4.1, Spark 2.1.1 and Elasticsearch Hadoop 5.4.2. I have three nodes with ~250 indices with ~750 shards (no replication) containing ~1.7 billion documents.

If I try to query one day worth of data (~5m documents) in pyspark

docs = spark.read\
    .format('org.elasticsearch.spark.sql')\
    .option('pushdown', True)\
    .option('double.filtering', True)\
    .load('index-name-*/type')
docs\
    .filter(docs.timestamp > datetime.datetime(2017, 5, 1, 0, 0))\
    .filter(docs.timestamp <= datetime.datetime(2017, 5, 2, 0, 0))\
    .count() # solely for demonstration

it takes virtually forever (>30m ?). The solution I found is to increase the "es.scroll.size" from its default value of 50 to the default setting of "index.max_result_window" of 10000. The operation then is completed in < 200s!

Is there any particular reason for the default setting of 50? Maybe this could be updated in the documentation?

Glad to hear you were able to fix your performance issue! 50 is a reasonable starting number for most users. In many cases you don't want the scroll size to default to 10000 due to the increased memory pressure on anything that is reading that much data all at once.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.