Possibility of querying all types in elasticsearch-spark instead of the strict index/type es.resource

ryanatmention · September 29, 2015, 10:53am

I'd like to do a query that takes the result of aggregations and returns it into a dataframe in spark. I don't know if this is feasible over all of the data because after reading the es-hadoop sources it looks like es.resource.read is restricted to a single type and index. Is it possible to get the aggregation data back without going through the rest API in spark?

eliasah · September 29, 2015, 11:41am

What do you mean with the following?

ryanatmention · September 29, 2015, 2:53pm

I mean I can get the results of aggregations over all elasticsearch types in an index easily with the elasticsearch rest api. For instance I'd like to create a data frame out of all the buckets in a bucket aggregation. I'd also like the query that I run to filter the docs before aggregating to be over multiple elasticsearch types. I could manually create a data frame from the elasticsearch rest api response but that will not take advantage of all the performance that can come from the native spark integration in elasticsearch-hadoop.

ryanatmention · October 1, 2015, 8:16pm

@costin do you have any input on this?

eliasah · October 1, 2015, 9:28pm

There is no native spark integration support for Elasticsearch.

The connector allows to pull data from Elasticsearch into Spark and back into Elasticsearch if need. They are two separate entities, two separates computing frameworks.

You can nevertheless build the same aggregations in Spark over your Dataframe as in Elasticsearch and you can even use the pipeline "buckets" aggregations over Spark which is not available on Elasticsearch prior to version 2.0 which is still in beta, I believe, as I'm writing this answer.

It's a overkill to use Elasticsearch for computing when you put Spark against it.

The case using Spark over Elasticseach comes handy in a situation like the following :
Heavy batch computing with Spark and Elasticsearch as a serving layer mimicking a pseudo-lambda architecture.

costin · October 5, 2015, 3:40pm

Aggregation support is part of es-spark/hadoop 2.2. Basically it will allow the results (whether it's multiple buckets or just one) to be returned as a result to the consuming party.
If by native Spark integration you mean running Spark in Elasticsearch that's not there possible nor does it make a lot of sense simply because ES is not a generic computational platform.

ryanatmention · October 7, 2015, 6:33pm

The aggregation support is great. But what if I have multiple types I want to do the query over at once and then aggregate the results?

Let's say I have two types. Signup and purchase. Both doc types have a userId field and a latency of the request. And I would like to get the terms aggregation where we bucket my userId and then for each user we have the percentiles of latency as a sub aggregation.

With elasticsearch'so rest api this is easy because I can decide not to filter the document types to restrict them to one. But to read with es-hadoop it looks like I must read with the resource pattern "/<doc_type>" which means I can do the aggregation on just the Signup documents or just the Purchase documents and then I have to work out how to merge the aggregation results which is nontrivial and I imagine in many cases mathematically impossible without dumping all of the elasticsearch hits into data frames and reimplementing the aggregations that elasticsearch was capable of doing in spark.

Does that make the use case clearer?

—
Sent from Mailbox

Topic		Replies	Views
Support for Aggregations in Spark Elasticsearch es-hadoop	2	1181	June 24, 2019
Possible to use aggregations Elasticsearch es-hadoop	3	717	July 6, 2017
ES Aggregations in Spark Elasticsearch es-hadoop	2	2170	July 6, 2017
Can we read data from index into spark by query? Elasticsearch es-hadoop	4	996	July 6, 2017
Use cases Elasticsearch and Spark Elasticsearch es-hadoop	5	3352	July 6, 2017

Possibility of querying all types in elasticsearch-spark instead of the strict index/type es.resource

Related topics