Hi,
I am using Spark 2.1.1 (PySpark) and Elasticsarch 5.4.2. Using the default values (double filtering enabled, strict mode disabled), the following code snipped misses up to 66 % of the 33M documents stored in the indexes depending on the width of the time window:
docs = spark.read.format('org.elasticsearch.spark.sql')\
.option('pushdown', True)\
.option('double.filtering', True)\
.option('strict', False)\
.load('index-*/type')
# Start and stop time
start = datetime.datetime(2017, 5, 1, 0, 0)
stop = datetime.datetime(2017, 5, 5, 0, 0)
# Loop through time window in steps of e.g. 3 hours
step = datetime.timedelta(hours = 3)
while now < stop:
# Remember the counts to compute the sum of all retreived documents...
count = docs\
.filter(docs.timestamp >= now)\
.filter(docs.timestamp < now + step)\
.count()
now += step
The timestamp filter is pushed down correctly:
17/07/25 07:51:46 DEBUG DataSource: Pushing down filters [IsNotNull(timestamp),GreaterThanOrEqual(timestamp,2017-05-01 00:00:00.0),LessThan(timestamp,2017-05-01 03:00:00.0)]
Setting "double.filtering" to "False" reduces the fraction of missed documents to 1.5 % consistently for several time windows which might point to a internal timezone conversion issue between python and elasticsearch.
Setting the timezone to the datetime objects does not improve the situation.