Top K results in RDD (beginner question)

tal · May 19, 2015, 3:20pm

Hey,
I just started using ES and spark, and i'm trying to get only the k best results by score, for a really simple query on simple documents (just id and content text)
How can i populate the RDD with only the top K results for that query? (i tried using the size option but since it's just page size and it seems that the rdd creation runs over all pages i still get all the results)

it would be great if anyone can point me in the right direction,
is it top_hits ? (even though i don't need any bucketing or anything like that?)

thanks!

costin · May 26, 2015, 5:12am

Aggregations (like top_hits) are not yet supported by the Elasticsearch Hadoop connector. Most likely they will be added in 2.2 as the 2.1 version already contains plenty of features and is on its way out.
Potentially one can do top_hits in Spark but in an inefficient way as it will pull all the data from Elasticsearch first.

tal · May 26, 2015, 11:55am

I see, thanks for the answer!

tal · May 31, 2015, 2:10pm

Hey - a quick follow up:
I'd like to sort by score in my spark code, for that i need the scores in my rdd, but I get it without metadata, how can i specify in the query that i would like to keep the _score field in my results?

thanks!

costin · June 1, 2015, 8:45am

The connector doesn't return any relevant score since the it relies on scan-and-scroll [1], that is it pulls the results as they appear from each shard. Having a score means the results need to be globally sorted which kills parallelism.
This will be addressed with the aggregation part where instead of "fan"-ing out the call, the connector will only make one global one.

[1] https://www.elastic.co/guide/en/elasticsearch/guide/master/scan-scroll.html

Topic		Replies	Views
(Spark) Top K results in esRDD (beginner question) Elasticsearch	1	338	July 6, 2017
Elastisearch-Hadoop how to do a bulk search in spark program Elasticsearch es-hadoop	2	975	October 10, 2017
Top hit aggregation with _score sorting Elasticsearch	3	1689	July 5, 2017
Add Es Spark Accumulators Elasticsearch es-hadoop	3	277	December 19, 2023
Sub queries with Spark hadoop Elasticsearch es-hadoop	3	891	March 5, 2017

Top K results in RDD (beginner question)

Related topics