I am using Spark 2.1.1 (PySpark) and Elasticsarch 5.4.2. Using the default values (double filtering enabled, strict mode disabled), the following code snipped misses up to 66 % of the 33M documents stored in the indexes depending on the width of the time window:
docs = spark.read.format('org.elasticsearch.spark.sql')\ .option('pushdown', True)\ .option('double.filtering', True)\ .option('strict', False)\ .load('index-*/type') # Start and stop time start = datetime.datetime(2017, 5, 1, 0, 0) stop = datetime.datetime(2017, 5, 5, 0, 0) # Loop through time window in steps of e.g. 3 hours step = datetime.timedelta(hours = 3) while now < stop: # Remember the counts to compute the sum of all retreived documents... count = docs\ .filter(docs.timestamp >= now)\ .filter(docs.timestamp < now + step)\ .count() now += step
The timestamp filter is pushed down correctly:
17/07/25 07:51:46 DEBUG DataSource: Pushing down filters [IsNotNull(timestamp),GreaterThanOrEqual(timestamp,2017-05-01 00:00:00.0),LessThan(timestamp,2017-05-01 03:00:00.0)]
Setting "double.filtering" to "False" reduces the fraction of missed documents to 1.5 % consistently for several time windows which might point to a internal timezone conversion issue between python and elasticsearch.
Setting the timezone to the datetime objects does not improve the situation.