Hello,
I've got some success working with Spark SQL CLI to access our ES data.
Environment:
ES: 1.3.2
es-hadoop: elasticsearch-hadoop-2.2.0
Spark: spark-1.4.1-bin-hadoop2.6
The pushdown feature is working great!
However, I noticed that for queries like:
SELECT count(member_id),member_id FROM data WHERE ( member_id > 2049510 AND member_id < 2049520 ) GROUP BY member_id ;
es-hadoop ends up reading all data from ES.
And hence these queries are taking way too long. In my case it took 44 minutes!
5383 2049512
875911 2049513
363823 2049516
840819 2049519
Time taken: 2644.187 seconds, Fetched 4 row(s)
Pushdown was working as confirmed from the logs:
2016-04-13 10:13:57 TRACE CommonsHttpTransport:463 - Tx [POST]@[x.x.x.x:9201][index/data/_search?search_type=scan&scroll=5m&size=50&_source=member_id&preference=_shards:1;_only_node:ShzMilJKRU6rByUEQpbs0g] w/ payload [{"query":{"filtered":{ "query":{"query_string":{ "query":"datatype:dal" }},"filter": { "and" : [ {"range":{"member_id":{"gt" :2049510}}},{"range":{"member_id":{"lt" :2049520}}} ] } }}}]
Is there a better way to get counts grouped by some field with the current es-hadoop version?