Correct setting of "es.scroll.size" with for optimal Spark read performance

ntim · June 28, 2017, 10:07am

Hi,

I am using ElasticSearch 5.4.1, Spark 2.1.1 and Elasticsearch Hadoop 5.4.2. I have three nodes with ~250 indices with ~750 shards (no replication) containing ~1.7 billion documents.

If I try to query one day worth of data (~5m documents) in pyspark

docs = spark.read\
    .format('org.elasticsearch.spark.sql')\
    .option('pushdown', True)\
    .option('double.filtering', True)\
    .load('index-name-*/type')
docs\
    .filter(docs.timestamp > datetime.datetime(2017, 5, 1, 0, 0))\
    .filter(docs.timestamp <= datetime.datetime(2017, 5, 2, 0, 0))\
    .count() # solely for demonstration

it takes virtually forever (>30m ?). The solution I found is to increase the "es.scroll.size" from its default value of 50 to the default setting of "index.max_result_window" of 10000. The operation then is completed in < 200s!

Is there any particular reason for the default setting of 50? Maybe this could be updated in the documentation?

james.baiera · June 29, 2017, 2:12pm

Glad to hear you were able to fix your performance issue! 50 is a reasonable starting number for most users. In many cases you don't want the scroll size to default to 10000 due to the increased memory pressure on anything that is reading that much data all at once.

system · July 27, 2017, 2:12pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Spark Reading/Writing Performance Elasticsearch es-hadoop	2	1618	July 6, 2017
ES ignores queries through Spark Elasticsearch	3	821	July 6, 2017
Reading from Elasticsearch index using spark ( es-hadoop ) connectors Elasticsearch es-hadoop	2	1708	March 22, 2022
EsHadoopInvalidRequest Batch size is too large Elasticsearch es-hadoop	2	1185	October 26, 2017
Scan/scroll - optimal "size" parameter Elasticsearch	2	1126	July 5, 2017

Correct setting of "es.scroll.size" with for optimal Spark read performance

Related topics