Spark SQL : How to specify a date range in WHERE clause?

(Preeti Raj - Buchhada) #1


I've got some success working with Spark SQL CLI to access our ES data.
ES: 1.3.2
es-hadoop: elasticsearch-hadoop-2.2.0
Spark: spark-1.4.1-bin-hadoop2.6

The pushdown feature is working great!

However, I'm stuck with how to specify a date range in the WHERE clause, so that it gets pushed down to ES?

I've tried:

SELECT count(member_id),member_id 
FROM data 
WHERE ( response_timestamp > CAST('2015-03-01' AS date) AND response_timestamp < CAST('2015-03-31' AS date) ) GROUP BY member_id ;

but it doesn't push down the range filter to ES.

Is this possible with the current es-hadoop version in the first place?
Please help!

Hive queries to read ES data taking too long | Need suggestions to improve
(Costin Leau) #2

ES-Hadoop implements all the filters/pushdown hooks available in Spark SQL. If Spark SQL doesn't push the operation, ES-Hadoop has no chance of doing the translation.
Likely in your case, the CAST is the one to blame.
Further more, I would recommend upgrading the Spark 1.6.1 as it has significant new features in Spark SQL.

(Chris Jones) #3

I'm having the same issue. In the original example, I don't see how the CAST would be to blame, as the CAST should be executed once per constant string, the string converted to a date and the constant date pushed down into ElasticSearch.

So, is there a hook that is needed? Is there a better way to formulate the query so that it does pushdown the (very simple) date filter into ES.

I have an index with millions of rows. I want to filter on date. But if the date filter is not pushed down, the millions of rows are loaded into Spark before the filter occurs, which means the query takes foreever. Running a date filter in Kibana or in Elastic DSL doesn't take nearly as long.

(James Baiera) #4

The best route would be to examine how the Catalyst optimizers react to the given query above. All hooks that are available are implemented at this time, and Date is definitely included in them.

(Chris Jones) #5

Dear James,Thanks for your response.
I'm not a developer of the elasticsearch-hadoop project , so I don't know how to
begin to evaluate the catalyst optimizers.
I'm very surprised that this very simple, innocent query below doesn't push down
into ES, nor has anyone found or suggested a reworking of the query to make the
pushdown work.

SELECT * FROM table1 WHERE createdAt=DATE(’3/6/2017’);
My instance of table1 has 20 million records, and the query processor loads the
20 million records into Spark in order to do a simple date filter.

(system) #6