ES Hadoop push down aggregations or not?

priamai · July 1, 2021, 6:24am

I am on ELK 7.13.0 and I have a super very slow query:

events_df = es_reader.load("priam_unified_host-{0}/unified-host".format(date))

counts_df = events_df.where('EventID == 4688').groupBy(col("LogHost"),col("ProcessName")).count()
counts_df.explain(extended=True)

This is the execution plan:


== Physical Plan ==
*(2) HashAggregate(keys=[LogHost#1347, ProcessName#1354], functions=[count(1)], output=[LogHost#1347, ProcessName#1354, count#1408L])
+- Exchange hashpartitioning(LogHost#1347, ProcessName#1354, 200), ENSURE_REQUIREMENTS, [id=#258]
   +- *(1) HashAggregate(keys=[LogHost#1347, ProcessName#1354], functions=[partial_count(1)], output=[LogHost#1347, ProcessName#1354, count#1413L])
      +- *(1) Project [LogHost#1347, ProcessName#1354]
         +- *(1) Filter (isnotnull(EventID#1345L) AND (EventID#1345L = 4688))
            +- *(1) Scan ElasticsearchRelation(Map(es.inferschema -> false, es.net.http.auth.user -> elastic, es.net.ssl.cert.allow.self.signed -> true, es.net.http.auth.pass -> 123456, es.read.field.as.array.include -> tags, es.resource -> priam_unified_host-2021-05-03/unified-host, es.nodes -> elasticsearch:9200, es.net.ssl -> true),org.apache.spark.sql.SQLContext@515d109c,None) [LogHost#1347,ProcessName#1354,EventID#1345L] PushedFilters: [IsNotNull(EventID), EqualTo(EventID,4688)], ReadSchema: struct<LogHost:string,ProcessName:string,EventID:bigint>

Does it mean that it is effectively scanning the index (via scrolling???) and not leveraging the Aggregates constructs in ES?

system · July 29, 2021, 6:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Keith_Massey · December 8, 2021, 2:09pm

You are correct. Es-hadoop supports pushing down most filter clauses but not aggregations/group bys. This functionality has just recently been added to spark and has not made it into es-hadoop yet -- Support Spark's Datasource V2 API · Issue #1801 · elastic/elasticsearch-hadoop · GitHub.

Topic		Replies	Views
Aggregation running very slow Elasticsearch es-hadoop	8	2998	July 6, 2017
Performance Challenge Elasticsearch es-hadoop	6	1081	April 28, 2017
Slow Performance of Elastic Search with Spark Elasticsearch es-hadoop	4	1535	July 29, 2021
Pushdown aggregations spark sql Elasticsearch es-hadoop	2	1391	July 6, 2017
Elastic cluster slow down afre a few weeks of uptime(cluster recommendations) Elasticsearch	17	865	January 17, 2020

ES Hadoop push down aggregations or not?

Related topics