Inconsistent scoring starting w/ 7.0.0 (worse in 7.4.0)

workmanw · March 28, 2023, 2:21am

Hello,

I'm in the process of upgrading from 6.x to 7.x. Along the way I discovered an unexpected issue with document updates and relevancy scoring. In my application's integration test suite we have a series of very basic tests that make simple updates and assert basic changes to scoring relevancy. Starting with 7.0.0 these tests have become unstable (fine with 6.8.23).

Here is an extremely basic example using the python SDK. This snippet creates two documents, it then updates a value on one of them, performs a search and repeats.

Edit: This is not related to the python SDK in anyway. The same happens with Java and postman.

from elasticsearch import Elasticsearch
es = Elasticsearch(['http://localhost:9800'])

# Create index and seed data
es.indices.create(index='test-index', body={
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "name": { "type": "text" },
      "num": { "type": "integer" }
    }
  }
})
es.index(index="test-index", id=1, doc_type="_doc", body={ 'name': 'hello', 'num': 0 })
es.index(index="test-index", id=2, doc_type="_doc", body={ 'name': 'hello', 'num': 0 })

# Update document, search and repeat
for i in range(1, 10):
  es.update(index="test-index", id=2, body={ 'doc': { 'num': i } }, refresh='wait_for')

  res = es.search(index="test-index", body={ 
    "query": {
      "match_phrase": {
        "name": { "query": "hello" }
      }
    }
  })

  results = sorted([(h["_id"], h["_score"]) for h in res["hits"]["hits"]], key=lambda x: x[0])
  print("[%d]: %s" %(i, results))

With 7.0.0, the above example will produce the following output:

[1]: [('1', 0.13353139), ('2', 0.13353139)]
[2]: [('1', 0.10536051), ('2', 0.10536051)]
[3]: [('1', 0.08701137), ('2', 0.08701137)]
[4]: [('1', 0.074107975), ('2', 0.074107975)]
[5]: [('1', 0.06453852), ('2', 0.06453852)]
[6]: [('1', 0.05715841), ('2', 0.05715841)]
[7]: [('1', 0.05129329), ('2', 0.05129329)]
[8]: [('1', 0.046520013), ('2', 0.046520013)]
[9]: [('1', 0.042559613), ('2', 0.042559613)]

There are a several things surprising about this. First is that doc-1's is not being updated at all, but it's relevancy is being impacted by doc-2's update. Second is that the updated value in doc-2 is unrelated to the query. I do understand it's likely to be impacted by segments and TF-IDF.

I add a forcemerge call into my loop and that seem to yield expected results through 7.3.2:

[1]: [('1', 0.13353139), ('2', 0.13353139)]
[2]: [('1', 0.13353139), ('2', 0.13353139)]
[3]: [('1', 0.13353139), ('2', 0.13353139)]
[4]: [('1', 0.13353139), ('2', 0.13353139)]
[5]: [('1', 0.13353139), ('2', 0.13353139)]
[6]: [('1', 0.13353139), ('2', 0.13353139)]
[7]: [('1', 0.13353139), ('2', 0.13353139)]
[8]: [('1', 0.13353139), ('2', 0.13353139)]
[9]: [('1', 0.13353139), ('2', 0.13353139)]

However starting with 7.4.0 this problem returns. I'm guessing the forcemerge is no longer effective in the same way it was prior.

Any advice would be really greatly appreciated. I'm just trying to get my basic integration test suite in working order. Not alterate the behavior of production runtime.

system · April 25, 2023, 2:22am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Inconsistent results for the same query on an index with 0 replicas Elasticsearch	7	810	February 8, 2021
Intermittent scoring returned Elasticsearch	3	264	July 6, 2017
Odd scoring behavior Elasticsearch	7	500	March 22, 2018
Inconsistent scores between versions Elasticsearch	2	756	February 7, 2017
Confidence Scores change once on updating the object, then change again on another search query with no conditions changed Elasticsearch	1	401	February 13, 2020

Inconsistent scoring starting w/ 7.0.0 (worse in 7.4.0)

Related topics