Recurring searches with the same request for dense_vector exhibit consistency issues in the results

Zona-hu · December 24, 2024, 6:22am

In an index without replicas, with no data being written, some vector requests, when repeated, yield inconsistent results.
This issue is reproducible in versions 8.13.4, 8.15.1, and 8.17.0, but cannot be reproduced in version 8.7.0, indicating that there is no bug in 8.7.0.

create index

curl --location --request PUT 'http://elasticsearch:9200/vector_test' \
--header 'Content-Type: application/json' \
--data '{
    "mappings": {
        "dynamic": "strict",
        "properties": {
            "vector": {
                "type": "dense_vector",
                "dims": 1024,
                "index": true,
                "similarity": "cosine",
                "index_options": {
                    "type": "hnsw",
                    "m": 16,
                    "ef_construction": 100
                }
            }
        }
    },
    "settings": {
        "index": {
            "routing": {
                "allocation": {
                    "include": {
                        "_tier_preference": "data_content"
                    }
                }
            },
            "refresh_interval": "30s",
            "number_of_shards": "1",
            "number_of_replicas": "0"
        }
    }
}'

Write 10,000 random vector values and then force a _refresh.

# -*- coding:utf-8 -*-

import json
import time

import numpy as np
import requests

REFRESH_URL = 'http://elasticsearch:9200/vector_test/_refresh'
BULK_URL = 'http://elasticsearch:9200/vector_test/_bulk'

request = requests.session()

# Generate a random vector with 1024 dimensions, where each value is a floating-point number between -1 and 1
def float32_uniform(min_value, max_value):
    random_float = np.random.uniform(min_value, max_value)
    return float(random_float)


def write():
    tmp_str = ''
    count = 0
    for id in range(10000):
        #
        vector = [float32_uniform(-1, 1) for _ in range(1024)]
        data = {'vector': vector}
        tmp_str += '{"index":{"_id":"' + str(id) + '"}}\n' + json.dumps(data) + '\n'
        if count == 1000:
            res = request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str)
            print(res.text)
            tmp_str = ''
            count = 0
            time.sleep(0.2)
        count += 1
    if count != 0 and tmp_str != '':
        print(request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str).json())
    request.post(REFRESH_URL)
    print("write success.")


if __name__ == '__main__':
    write()

Begin testing to reproduce the issue. This experiment is repeated 100 times: for each iteration, a random vector is constructed and requested 100 times.

# -*- coding:utf-8 -*-

import json

import numpy as np
import requests

SEARCH_URL = 'http://elasticsearch:9200/vector_test/_search'

request = requests.session()


def float32_uniform(min_value, max_value):
    random_float = np.random.uniform(min_value, max_value)
    return float(random_float)


def request_test(loop_count, k, num_candidates):
    vector = [float32_uniform(-1, 1) for _ in range(1024)]
    body = {"from": 0, "size": 10,
            "knn": {"field": "vector", "query_vector": vector, "k": k, "num_candidates": num_candidates},
            "_source": False}
    result_dict = {}
    for i in range(loop_count):
        response = request.post(url=SEARCH_URL, json=body).json()
        hits = response['hits']['hits']
        hits_str = json.dumps(hits, ensure_ascii=False)
        if hits_str in result_dict:
            result_dict[hits_str] += 1
        else:
            result_dict[hits_str] = 1
    data_list = []
    for res, count in result_dict.items():
        data_list.append({"data": res, "count": count})

    base_count = 0
    for item in sorted(data_list, key=lambda s: s['count'], reverse=True):
        base_count = item['count']
        break
    error_count = loop_count - base_count
    print('{}/{}'.format(base_count, error_count))
    return base_count, error_count


if __name__ == '__main__':
    success = total = 0
    for i in range(100):
        base, error_count = request_test(loop_count=100, k=10, num_candidates=20)
        success += base
        total += base + error_count
    print('{}/{}'.format(success, total))

Below are the test results from version 8.17.0, which show consistency issues; versions 8.13.4 and 8.15.1 also have the same problem.

The following are the test results from version 8.7.0, and I have assessed the consistency to be 100%.

If the index is forcibly merged into a single segment with forcemerge, the results become stable again.

Zona-hu · December 24, 2024, 6:24am

github.com/elastic/elasticsearch

Recurring searches with the same request for dense_vector exhibit consistency issues in the results.

opened 08:34AM - 20 Dec 24 UTC

Zona-hu

>bug needs:triage

### Elasticsearch Version 8.17.0 ### Installed Plugins _No response_ ### Jav…a Version openjdk version "23" 2024-09-17 OpenJDK Runtime Environment (build 23+37-2369) OpenJDK 64-Bit Server VM (build 23+37-2369, mixed mode, sharing) ### OS Version Linux debian-002 6.1.0-18-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.76-1 (2024-02-01) x86_64 GNU/Linux ### Problem Description In an index without replicas, with no data being written, some vector requests, when repeated, yield inconsistent results. This issue is reproducible in versions 8.13.4, 8.15.1, and 8.17.0, but cannot be reproduced in version 8.7.0, indicating that there is no bug in 8.7.0. ### Steps to Reproduce Here are the steps to reproduce the issue: 1. Create index ``` curl --location --request PUT 'http://elasticsearch:9200/vector_test' \ --header 'Content-Type: application/json' \ --data '{ "mappings": { "dynamic": "strict", "properties": { "vector": { "type": "dense_vector", "dims": 1024, "index": true, "similarity": "cosine", "index_options": { "type": "hnsw", "m": 16, "ef_construction": 100 } } } }, "settings": { "index": { "routing": { "allocation": { "include": { "_tier_preference": "data_content" } } }, "refresh_interval": "30s", "number_of_shards": "1", "number_of_replicas": "0" } } }' ``` 2. Write 10,000 random vector values and then force a _refresh. ``` # -*- coding:utf-8 -*- import json import time import numpy as np import requests REFRESH_URL = 'http://elasticsearch:9200/vector_test/_refresh' BULK_URL = 'http://elasticsearch:9200/vector_test/_bulk' request = requests.session() # Generate a random vector with 1024 dimensions, where each value is a floating-point number between -1 and 1 def float32_uniform(min_value, max_value): random_float = np.random.uniform(min_value, max_value) return float(random_float) def write(): tmp_str = '' count = 0 for id in range(10000): # vector = [float32_uniform(-1, 1) for _ in range(1024)] data = {'vector': vector} tmp_str += '{"index":{"_id":"' + str(id) + '"}}\n' + json.dumps(data) + '\n' if count == 1000: res = request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str) print(res.text) tmp_str = '' count = 0 time.sleep(0.2) count += 1 if count != 0 and tmp_str != '': print(request.post(url=BULK_URL, headers={"Content-Type": "application/x-ndjson"}, data=tmp_str).json()) request.post(REFRESH_URL) print("write success.") if __name__ == '__main__': write() ``` 3. Begin testing to reproduce the issue. This experiment is repeated 100 times: for each iteration, a random vector is constructed and requested 100 times. ``` # -*- coding:utf-8 -*- import json import numpy as np import requests SEARCH_URL = 'http://elasticsearch:9200/vector_test/_search' request = requests.session() def float32_uniform(min_value, max_value): random_float = np.random.uniform(min_value, max_value) return float(random_float) def request_test(loop_count, k, num_candidates): vector = [float32_uniform(-1, 1) for _ in range(1024)] body = {"from": 0, "size": 10, "knn": {"field": "vector", "query_vector": vector, "k": k, "num_candidates": num_candidates}, "_source": False} result_dict = {} for i in range(loop_count): response = request.post(url=SEARCH_URL, json=body).json() hits = response['hits']['hits'] hits_str = json.dumps(hits, ensure_ascii=False) if hits_str in result_dict: result_dict[hits_str] += 1 else: result_dict[hits_str] = 1 data_list = [] for res, count in result_dict.items(): data_list.append({"data": res, "count": count}) base_count = 0 for item in sorted(data_list, key=lambda s: s['count'], reverse=True): base_count = item['count'] break error_count = loop_count - base_count print('{}/{}'.format(base_count, error_count)) return base_count, error_count if __name__ == '__main__': success = total = 0 for i in range(100): base, error_count = request_test(loop_count=100, k=10, num_candidates=20) success += base total += base + error_count print('{}/{}'.format(success, total)) ``` Below are the test results from version 8.17.0, which show consistency issues; versions 8.13.4 and 8.15.1 also have the same problem. ![Image](https://github.com/user-attachments/assets/edc7788f-8e7d-4d20-a7b9-7d4bf2d85bcf) The following are the test results from version 8.7.0, and I have assessed the consistency to be 100%. ![Image](https://github.com/user-attachments/assets/1700e5cd-d778-4c81-a401-9e5282aa35ed) ### Logs (if relevant) _No response_

I've submitted an issue on GitHub, but no one has responded. Could someone please take the time to test whether my conclusion is correct?

S-Dragon0302 · April 2, 2025, 10:35am

I also encountered this problem, is there any solution?

Topic		Replies	Views
Inconsistent search results Elasticsearch	8	3998	July 6, 2017
Search results not uniform Elasticsearch	7	341	July 6, 2017
I search same thing, but once can get and once can not get? Elasticsearch	7	403	July 6, 2017
Exact duplicate results (same _id) for a search query. Is this a bug? Elasticsearch	5	565	July 6, 2017
Index Inconsistency Elasticsearch	6	502	July 6, 2017

Recurring searches with the same request for dense_vector exhibit consistency issues in the results

Related topics