Timezone issue with default double filtering of timestamps in Spark

ntim · July 25, 2017, 8:54am

Hi,

I am using Spark 2.1.1 (PySpark) and Elasticsarch 5.4.2. Using the default values (double filtering enabled, strict mode disabled), the following code snipped misses up to 66 % of the 33M documents stored in the indexes depending on the width of the time window:

docs = spark.read.format('org.elasticsearch.spark.sql')\
    .option('pushdown', True)\
    .option('double.filtering', True)\
    .option('strict', False)\
    .load('index-*/type')
# Start and stop time
start = datetime.datetime(2017, 5, 1, 0, 0)
stop = datetime.datetime(2017, 5, 5, 0, 0)
# Loop through time window in steps of e.g. 3 hours
step = datetime.timedelta(hours = 3)
while now < stop:
    # Remember the counts to compute the sum of all retreived documents...
    count = docs\
        .filter(docs.timestamp >= now)\
        .filter(docs.timestamp < now + step)\
        .count()
    now += step

The timestamp filter is pushed down correctly:

17/07/25 07:51:46 DEBUG DataSource: Pushing down filters [IsNotNull(timestamp),GreaterThanOrEqual(timestamp,2017-05-01 00:00:00.0),LessThan(timestamp,2017-05-01 03:00:00.0)]

Setting "double.filtering" to "False" reduces the fraction of missed documents to 1.5 % consistently for several time windows which might point to a internal timezone conversion issue between python and elasticsearch.

Setting the timezone to the datetime objects does not improve the situation.

system · August 22, 2017, 8:54am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.