Timezone issue with default double filtering of timestamps in Spark


I am using Spark 2.1.1 (PySpark) and Elasticsarch 5.4.2. Using the default values (double filtering enabled, strict mode disabled), the following code snipped misses up to 66 % of the 33M documents stored in the indexes depending on the width of the time window:

docs = spark.read.format('org.elasticsearch.spark.sql')\
    .option('pushdown', True)\
    .option('double.filtering', True)\
    .option('strict', False)\
# Start and stop time
start = datetime.datetime(2017, 5, 1, 0, 0)
stop = datetime.datetime(2017, 5, 5, 0, 0)
# Loop through time window in steps of e.g. 3 hours
step = datetime.timedelta(hours = 3)
while now < stop:
    # Remember the counts to compute the sum of all retreived documents...
    count = docs\
        .filter(docs.timestamp >= now)\
        .filter(docs.timestamp < now + step)\
    now += step

The timestamp filter is pushed down correctly:

17/07/25 07:51:46 DEBUG DataSource: Pushing down filters [IsNotNull(timestamp),GreaterThanOrEqual(timestamp,2017-05-01 00:00:00.0),LessThan(timestamp,2017-05-01 03:00:00.0)]

Setting "double.filtering" to "False" reduces the fraction of missed documents to 1.5 % consistently for several time windows which might point to a internal timezone conversion issue between python and elasticsearch.

Setting the timezone to the datetime objects does not improve the situation.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.