ES Hadoop push down aggregations or not?

I am on ELK 7.13.0 and I have a super very slow query:

events_df = es_reader.load("priam_unified_host-{0}/unified-host".format(date))

counts_df = events_df.where('EventID == 4688').groupBy(col("LogHost"),col("ProcessName")).count()
counts_df.explain(extended=True)

This is the execution plan:


== Physical Plan ==
*(2) HashAggregate(keys=[LogHost#1347, ProcessName#1354], functions=[count(1)], output=[LogHost#1347, ProcessName#1354, count#1408L])
+- Exchange hashpartitioning(LogHost#1347, ProcessName#1354, 200), ENSURE_REQUIREMENTS, [id=#258]
   +- *(1) HashAggregate(keys=[LogHost#1347, ProcessName#1354], functions=[partial_count(1)], output=[LogHost#1347, ProcessName#1354, count#1413L])
      +- *(1) Project [LogHost#1347, ProcessName#1354]
         +- *(1) Filter (isnotnull(EventID#1345L) AND (EventID#1345L = 4688))
            +- *(1) Scan ElasticsearchRelation(Map(es.inferschema -> false, es.net.http.auth.user -> elastic, es.net.ssl.cert.allow.self.signed -> true, es.net.http.auth.pass -> 123456, es.read.field.as.array.include -> tags, es.resource -> priam_unified_host-2021-05-03/unified-host, es.nodes -> elasticsearch:9200, es.net.ssl -> true),org.apache.spark.sql.SQLContext@515d109c,None) [LogHost#1347,ProcessName#1354,EventID#1345L] PushedFilters: [IsNotNull(EventID), EqualTo(EventID,4688)], ReadSchema: struct<LogHost:string,ProcessName:string,EventID:bigint>

Does it mean that it is effectively scanning the index (via scrolling???) and not leveraging the Aggregates constructs in ES?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

You are correct. Es-hadoop supports pushing down most filter clauses but not aggregations/group bys. This functionality has just recently been added to spark and has not made it into es-hadoop yet -- Support Spark's Datasource V2 API · Issue #1801 · elastic/elasticsearch-hadoop · GitHub.