I am on ELK 7.13.0 and I have a super very slow query:
events_df = es_reader.load("priam_unified_host-{0}/unified-host".format(date))
counts_df = events_df.where('EventID == 4688').groupBy(col("LogHost"),col("ProcessName")).count()
counts_df.explain(extended=True)
This is the execution plan:
== Physical Plan ==
*(2) HashAggregate(keys=[LogHost#1347, ProcessName#1354], functions=[count(1)], output=[LogHost#1347, ProcessName#1354, count#1408L])
+- Exchange hashpartitioning(LogHost#1347, ProcessName#1354, 200), ENSURE_REQUIREMENTS, [id=#258]
+- *(1) HashAggregate(keys=[LogHost#1347, ProcessName#1354], functions=[partial_count(1)], output=[LogHost#1347, ProcessName#1354, count#1413L])
+- *(1) Project [LogHost#1347, ProcessName#1354]
+- *(1) Filter (isnotnull(EventID#1345L) AND (EventID#1345L = 4688))
+- *(1) Scan ElasticsearchRelation(Map(es.inferschema -> false, es.net.http.auth.user -> elastic, es.net.ssl.cert.allow.self.signed -> true, es.net.http.auth.pass -> 123456, es.read.field.as.array.include -> tags, es.resource -> priam_unified_host-2021-05-03/unified-host, es.nodes -> elasticsearch:9200, es.net.ssl -> true),org.apache.spark.sql.SQLContext@515d109c,None) [LogHost#1347,ProcessName#1354,EventID#1345L] PushedFilters: [IsNotNull(EventID), EqualTo(EventID,4688)], ReadSchema: struct<LogHost:string,ProcessName:string,EventID:bigint>
Does it mean that it is effectively scanning the index (via scrolling???) and not leveraging the Aggregates constructs in ES?