Top K results in RDD (beginner question)

(tal) #1

I just started using ES and spark, and i'm trying to get only the k best results by score, for a really simple query on simple documents (just id and content text)
How can i populate the RDD with only the top K results for that query? (i tried using the size option but since it's just page size and it seems that the rdd creation runs over all pages i still get all the results)

it would be great if anyone can point me in the right direction,
is it top_hits ? (even though i don't need any bucketing or anything like that?)


(Costin Leau) #2

Aggregations (like top_hits) are not yet supported by the Elasticsearch Hadoop connector. Most likely they will be added in 2.2 as the 2.1 version already contains plenty of features and is on its way out.
Potentially one can do top_hits in Spark but in an inefficient way as it will pull all the data from Elasticsearch first.

(tal) #3

I see, thanks for the answer!

(tal) #4

Hey - a quick follow up:
I'd like to sort by score in my spark code, for that i need the scores in my rdd, but I get it without metadata, how can i specify in the query that i would like to keep the _score field in my results?


(Costin Leau) #5

The connector doesn't return any relevant score since the it relies on scan-and-scroll [1], that is it pulls the results as they appear from each shard. Having a score means the results need to be globally sorted which kills parallelism.
This will be addressed with the aggregation part where instead of "fan"-ing out the call, the connector will only make one global one.


(system) #6