I'd like to do a query that takes the result of aggregations and returns it into a dataframe in spark. I don't know if this is feasible over all of the data because after reading the es-hadoop sources it looks like es.resource.read is restricted to a single type and index. Is it possible to get the aggregation data back without going through the rest API in spark?
What do you mean with the following?
I mean I can get the results of aggregations over all elasticsearch types in an index easily with the elasticsearch rest api. For instance I'd like to create a data frame out of all the buckets in a bucket aggregation. I'd also like the query that I run to filter the docs before aggregating to be over multiple elasticsearch types. I could manually create a data frame from the elasticsearch rest api response but that will not take advantage of all the performance that can come from the native spark integration in elasticsearch-hadoop.
@costin do you have any input on this?
There is no native spark integration support for Elasticsearch.
The connector allows to pull data from Elasticsearch into Spark and back into Elasticsearch if need. They are two separate entities, two separates computing frameworks.
You can nevertheless build the same aggregations in Spark over your Dataframe as in Elasticsearch and you can even use the pipeline "buckets" aggregations over Spark which is not available on Elasticsearch prior to version 2.0 which is still in beta, I believe, as I'm writing this answer.
It's a overkill to use Elasticsearch for computing when you put Spark against it.
The case using Spark over Elasticseach comes handy in a situation like the following :
Heavy batch computing with Spark and Elasticsearch as a serving layer mimicking a pseudo-lambda architecture.
Aggregation support is part of es-spark/hadoop 2.2. Basically it will allow the results (whether it's multiple buckets or just one) to be returned as a result to the consuming party.
If by native Spark integration you mean running Spark in Elasticsearch that's not there possible nor does it make a lot of sense simply because ES is not a generic computational platform.
The aggregation support is great. But what if I have multiple types I want to do the query over at once and then aggregate the results?
Let's say I have two types. Signup and purchase. Both doc types have a userId field and a latency of the request. And I would like to get the terms aggregation where we bucket my userId and then for each user we have the percentiles of latency as a sub aggregation.
With elasticsearch'so rest api this is easy because I can decide not to filter the document types to restrict them to one. But to read with es-hadoop it looks like I must read with the resource pattern "/<doc_type>" which means I can do the aggregation on just the Signup documents or just the Purchase documents and then I have to work out how to merge the aggregation results which is nontrivial and I imagine in many cases mathematically impossible without dumping all of the elasticsearch hits into data frames and reimplementing the aggregations that elasticsearch was capable of doing in spark.
Does that make the use case clearer?
—
Sent from Mailbox