Sort MLT query results: spark + scala

nobre0 · June 21, 2022, 4:35pm

Hi,

I'm trying to run the 'More Like This' (MLT) query using the apache spark connector.
The problem is that the result is not sorted by computed MLT score. I think it is related to the sort=_doc parameter added in the query builder.

the code is the following:

val localSpark = SparkSession
    .builder()
    .appName("teste")
    .config("spark.es.nodes", "localhost")
    .config("spark.es.port", "9200")
    .config("es.mapping.id", "id")
    .config("es.write.operation", "upsert")
    .config("spark.es.nodes.wan.only", "true") 
    .config("es.scroll.size", 15)
    .master("local").getOrCreate()

val query = """{"query" : {"more_like_this": { "fields": ["text"], "like": [{"_index": "documents", "_id": "1234"}]}}}"""

val df = localSpark.read.format("org.elasticsearch.spark.sql").option("query", query).option("pushdown", "true").load("documents")

setup:

java: 1.8
spark : 3.1.0
scala: 2.12.12
"elasticsearch-spark-30" % "8.2.2"

Keith_Massey · June 22, 2022, 7:34pm

I was able to reproduce this problem. The default sort of _doc makes sense, since that is the most efficient way for a scroll to pull back data. But I thought that maybe adding a sort field to it like this would work:

val query = """{"sort":"_score","query" : {"more_like_this": { "fields": ["text"], "like": [{"_index": "documents", "_id": "1234"}],"min_term_freq": 1,"min_doc_freq": 1}}}"""

Unfortunately it looks like that sort is silently ignored and the results are still ordered by _doc.. It looks like a bug. You can probably sort the results by _score on the spark side, but that is not going to perform as well if you have a very large amount of data.

nobre0 · June 22, 2022, 8:59pm

In my case, getting all results and then sorting by _score is impracticable.
I will open an issue since it is probably a bug.

Thanks for your response.

system · July 20, 2022, 8:59pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to set Sort Order to "_doc" using TransportClient Elasticsearch	2	1047	July 5, 2017
Elastic search query for sorting results Elasticsearch	1	350	June 12, 2020
Add another one sorting to lift 3 docs to positions 3,4,5 Elasticsearch painless , eql-elastic-query-language	3	179	January 23, 2024
Script sorting issue 0.90.9 Elasticsearch	3	374	July 6, 2017
ES sort in server end when given fileds are equal Elasticsearch eql-elastic-query-language	5	699	February 17, 2022

Sort MLT query results: spark + scala

Related topics