Spark SQL: count(*) fetches all data

(Preeti Raj - Buchhada) #1


I've got some success working with Spark SQL CLI to access our ES data.
ES: 1.3.2
es-hadoop: elasticsearch-hadoop-2.2.0
Spark: spark-1.4.1-bin-hadoop2.6

The pushdown feature is working great!

However, I noticed that for queries like:

SELECT count(member_id),member_id FROM data WHERE ( member_id > 2049510 AND member_id < 2049520 ) GROUP BY member_id ;

es-hadoop ends up reading all data from ES.

And hence these queries are taking way too long. In my case it took 44 minutes!

5383 2049512
875911 2049513
363823 2049516
840819 2049519
Time taken: 2644.187 seconds, Fetched 4 row(s)

Pushdown was working as confirmed from the logs:

2016-04-13 10:13:57 TRACE CommonsHttpTransport:463 - Tx [POST]@[x.x.x.x:9201][index/data/_search?search_type=scan&scroll=5m&size=50&_source=member_id&preference=_shards:1;_only_node:ShzMilJKRU6rByUEQpbs0g] w/ payload [{"query":{"filtered":{ "query":{"query_string":{ "query":"datatype:dal" }},"filter": { "and" : [ {"range":{"member_id":{"gt" :2049510}}},{"range":{"member_id":{"lt" :2049520}}} ] } }}}]

Is there a better way to get counts grouped by some field with the current es-hadoop version?

Hive queries to read ES data taking too long | Need suggestions to improve
(Costin Leau) #2

Not sure what you are asking and why you think es-hadoop reads all the data. It's the Spark SQL planner that decides what queries to make which in turn use ES-Hadoop to get the data from ES; in some cases by pushing down, in others by requesting the data.
Outside enabling logging in ES-Hadoop to see the queries that it makes, check out Spark SQL and its performance tuning section in the docs.

(system) #3