Inconsistent scoring starting w/ 7.0.0 (worse in 7.4.0)

Hello,

I'm in the process of upgrading from 6.x to 7.x. Along the way I discovered an unexpected issue with document updates and relevancy scoring. In my application's integration test suite we have a series of very basic tests that make simple updates and assert basic changes to scoring relevancy. Starting with 7.0.0 these tests have become unstable (fine with 6.8.23).

Here is an extremely basic example using the python SDK. This snippet creates two documents, it then updates a value on one of them, performs a search and repeats.

Edit: This is not related to the python SDK in anyway. The same happens with Java and postman.

from elasticsearch import Elasticsearch
es = Elasticsearch(['http://localhost:9800'])

# Create index and seed data
es.indices.create(index='test-index', body={
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "name": { "type": "text" },
      "num": { "type": "integer" }
    }
  }
})
es.index(index="test-index", id=1, doc_type="_doc", body={ 'name': 'hello', 'num': 0 })
es.index(index="test-index", id=2, doc_type="_doc", body={ 'name': 'hello', 'num': 0 })

# Update document, search and repeat
for i in range(1, 10):
  es.update(index="test-index", id=2, body={ 'doc': { 'num': i } }, refresh='wait_for')

  res = es.search(index="test-index", body={ 
    "query": {
      "match_phrase": {
        "name": { "query": "hello" }
      }
    }
  })

  results = sorted([(h["_id"], h["_score"]) for h in res["hits"]["hits"]], key=lambda x: x[0])
  print("[%d]: %s" %(i, results))

With 7.0.0, the above example will produce the following output:

[1]: [('1', 0.13353139), ('2', 0.13353139)]
[2]: [('1', 0.10536051), ('2', 0.10536051)]
[3]: [('1', 0.08701137), ('2', 0.08701137)]
[4]: [('1', 0.074107975), ('2', 0.074107975)]
[5]: [('1', 0.06453852), ('2', 0.06453852)]
[6]: [('1', 0.05715841), ('2', 0.05715841)]
[7]: [('1', 0.05129329), ('2', 0.05129329)]
[8]: [('1', 0.046520013), ('2', 0.046520013)]
[9]: [('1', 0.042559613), ('2', 0.042559613)]

There are a several things surprising about this. First is that doc-1's is not being updated at all, but it's relevancy is being impacted by doc-2's update. Second is that the updated value in doc-2 is unrelated to the query. I do understand it's likely to be impacted by segments and TF-IDF.

I add a forcemerge call into my loop and that seem to yield expected results through 7.3.2:

[1]: [('1', 0.13353139), ('2', 0.13353139)]
[2]: [('1', 0.13353139), ('2', 0.13353139)]
[3]: [('1', 0.13353139), ('2', 0.13353139)]
[4]: [('1', 0.13353139), ('2', 0.13353139)]
[5]: [('1', 0.13353139), ('2', 0.13353139)]
[6]: [('1', 0.13353139), ('2', 0.13353139)]
[7]: [('1', 0.13353139), ('2', 0.13353139)]
[8]: [('1', 0.13353139), ('2', 0.13353139)]
[9]: [('1', 0.13353139), ('2', 0.13353139)]

However starting with 7.4.0 this problem returns. I'm guessing the forcemerge is no longer effective in the same way it was prior.

Any advice would be really greatly appreciated. I'm just trying to get my basic integration test suite in working order. Not alterate the behavior of production runtime.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.