Spark SQL: count(*) fetches all data

Preeti_Raj_Buchhada · April 13, 2016, 11:37am

Hello,

I've got some success working with Spark SQL CLI to access our ES data.
Environment:
ES: 1.3.2
es-hadoop: elasticsearch-hadoop-2.2.0
Spark: spark-1.4.1-bin-hadoop2.6

The pushdown feature is working great!

However, I noticed that for queries like:

SELECT count(member_id),member_id FROM data WHERE ( member_id > 2049510 AND member_id < 2049520 ) GROUP BY member_id ;

es-hadoop ends up reading all data from ES.

And hence these queries are taking way too long. In my case it took 44 minutes!

5383 2049512
875911 2049513
363823 2049516
840819 2049519
Time taken: 2644.187 seconds, Fetched 4 row(s)

Pushdown was working as confirmed from the logs:

2016-04-13 10:13:57 TRACE CommonsHttpTransport:463 - Tx [POST]@[x.x.x.x:9201][index/data/_search?search_type=scan&scroll=5m&size=50&_source=member_id&preference=_shards:1;_only_node:ShzMilJKRU6rByUEQpbs0g] w/ payload [{"query":{"filtered":{ "query":{"query_string":{ "query":"datatype:dal" }},"filter": { "and" : [ {"range":{"member_id":{"gt" :2049510}}},{"range":{"member_id":{"lt" :2049520}}} ] } }}}]

Is there a better way to get counts grouped by some field with the current es-hadoop version?

costin · April 13, 2016, 11:59am

Not sure what you are asking and why you think es-hadoop reads all the data. It's the Spark SQL planner that decides what queries to make which in turn use ES-Hadoop to get the data from ES; in some cases by pushing down, in others by requesting the data.
Outside enabling logging in ES-Hadoop to see the queries that it makes, check out Spark SQL and its performance tuning section in the docs.

Topic		Replies	Views
Hive queries to read ES data taking too long \| Need suggestions to improve Elasticsearch es-hadoop	4	2252	July 6, 2017
Aggregation running very slow Elasticsearch es-hadoop	8	2999	July 6, 2017
Slow Performance of Elastic Search with Spark Elasticsearch es-hadoop	4	1535	July 29, 2021
Performance Challenge Elasticsearch es-hadoop	6	1081	April 28, 2017
Spark SQL advices for performance Elasticsearch es-hadoop	3	1441	July 6, 2017

Spark SQL: count(*) fetches all data

Related topics