Sort MLT query results: spark + scala

Hi,

I'm trying to run the 'More Like This' (MLT) query using the apache spark connector.
The problem is that the result is not sorted by computed MLT score. I think it is related to the sort=_doc parameter added in the query builder.

the code is the following:

val localSpark = SparkSession
    .builder()
    .appName("teste")
    .config("spark.es.nodes", "localhost")
    .config("spark.es.port", "9200")
    .config("es.mapping.id", "id")
    .config("es.write.operation", "upsert")
    .config("spark.es.nodes.wan.only", "true") 
    .config("es.scroll.size", 15)
    .master("local").getOrCreate()

val query = """{"query" : {"more_like_this": { "fields": ["text"], "like": [{"_index": "documents", "_id": "1234"}]}}}"""

val df = localSpark.read.format("org.elasticsearch.spark.sql").option("query", query).option("pushdown", "true").load("documents")

setup:

  • java: 1.8
  • spark : 3.1.0
  • scala: 2.12.12
  • "elasticsearch-spark-30" % "8.2.2"

I was able to reproduce this problem. The default sort of _doc makes sense, since that is the most efficient way for a scroll to pull back data. But I thought that maybe adding a sort field to it like this would work:

val query = """{"sort":"_score","query" : {"more_like_this": { "fields": ["text"], "like": [{"_index": "documents", "_id": "1234"}],"min_term_freq": 1,"min_doc_freq": 1}}}"""

Unfortunately it looks like that sort is silently ignored and the results are still ordered by _doc.. It looks like a bug. You can probably sort the results by _score on the spark side, but that is not going to perform as well if you have a very large amount of data.

In my case, getting all results and then sorting by _score is impracticable.
I will open an issue since it is probably a bug.

Thanks for your response.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.