The same search statement multiple times gets different total results

deserts · October 11, 2024, 2:06pm

I am not sure if there is a teacher who is familiar with the elasticsearch source code to help me look at this issue I raised on github. At present, I cannot get stable query results using the same conditions. The reason is that the refresh thread pool of a datanode is full.

github.com/elastic/elasticsearch

The same search statement multiple times gets different total results

opened 10:28AM - 20 Sep 24 UTC

killersteps

:Search/Search :Distributed/Engine Team:Distributed Team:Search

### One sentence description When I search using the following statement, the s…earch gets different total hits multiple times. ### For example: "hits": { "total": { "value": 6000000(This value is unstable and may change occasionally, but in fact the data in this index will never change from a business perspective), "relation": "eq" }, ... } ### Basic Information: Elasticsearch version: 7.17.1 Cpu architecture: aarch64 Jdk version: 1.8 Cluster overview: 9 nodes, including 6 data nodes and 3 master nodes(Server resources and jvm resources are absolutely sufficient.) Index(my_index) overview: 9 primary shards, 1 replica shard ### Search statement **http://:9200/my_index/_search?request_cache=false&explain=true&track_total_hits=true&q=*** In the above search statement, I used three key parameters, corresponding to different purposes: track_total_hits=true: to get an accurate hits total value request_cache=false: disable query cache, eliminate the impact of cache explain=true: explain the search, used to assist query In actual applications, the query has some conditions. I am using q=* to check for unstable total doc counts. And again, the data in these problematic indexes will never change. In addition, when checking some basic status through /_cluster/health and /_cat/shards, they are all GREEN and STARTED It is worth mentioning that I have carefully analyzed this issue in combination with the sharing in this article: https://www.elastic.co/guide/en/elasticsearch/reference/7.17/consistent-scoring.html#_scores_are_not_reproducible When I search using preference=_shards:0,1,2,3,4,5,6,7,8, the problem still recurs randomly My current conclusion is that the committed status of some segments of the shards allocated on the datanode03 node of the cluster is false When I use preference=_shards:0,1,2,3,4,5 to filter out the three shards 6,7,8 allocated on datanode03 (use /cat/shards/my_index to get this information), the hits total value becomes stable But the most incomprehensible situation is that when observing all parameters such as os, jvm, mem, etc. through /nodes/datanode03/_stats, no possible abnormal information is found. They all look normal. But the most incomprehensible situation is that I observed all the parameters such as os, jvm, mem, etc. through /nodes/datanode03/_stats, and no possible abnormal information was found. They all looked normal. I currently only have some segments that are not committed and the shards where these segments are located are coincidentally allocated on the datanode03 node. This is a valuable piece of information. I want to get three help or discussion: 1. Is /_flush necessary? I initially judged that this action can solve this problem, but it is difficult to reproduce the phenomenon of why some segments are difficult to commit in other environments. In other words, if we don’t know why this situation occurs, we can’t avoid the problem from happening again in the future 2. From the server level and node level, we have analyzed that the datanode03 server does not have any abnormal information, logs, etc. How can I judge whether this node really has problems? From the phenomenon of all problematic indexes, all of them are segments in the shard allocated to this node 3. Will committed=false and searchable=false in the segment really affect the simplest query statement of _search?request_cache=false&explain=true&track_total_hits=true&q=*?

deserts · October 31, 2024, 7:44am

@DavidTurner
Unfortunately, this question has been raised in this forum for almost a month, and no one has paid attention to it. - ,-

In addition, I don’t think that the fact that no other users have encountered this problem means that this version does not have this problem. Perhaps the ARM architecture CPU and operating system I am using are also worthy of attention?

Regarding the thread stack, I used jstack to collect all the information when this problem occurred. Please give me some guidance if you have time.

github.com

killersteps/jstack-dump/blob/main/jstack-1.txt

2024-10-30 09:50:38
Full thread dump Java HotSpot(TM) 64-Bit Server VM (25.333-b02 mixed mode):

"elasticsearch[node-credit-2][flush][T#1799]" #467337 daemon prio=5 os_prio=0 tid=0x0000fffc2014e800 nid=0x165ab3 waiting on condition [0x0000fffbbf17e000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x000000062a467ae0> (a org.elasticsearch.common.util.concurrent.EsExecutors$ExecutorScalingQueue)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
	at java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:734)
	at java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647)
	at java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1277)
	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1073)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1134)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

   Locked ownable synchronizers:
	- None

"elasticsearch[node-credit-2][flush][T#1800]" #467336 daemon prio=5 os_prio=0 tid=0x0000fffc1c188800 nid=0x165ab2 waiting on condition [0x0000fffbc03ee000]

This file has been truncated. show original

https://github.com/killersteps/jstack-dump/blob/main/jstack-2.txt

Christian_Dahlqvist · October 31, 2024, 8:04am

1. Swap is not turned off on our node servers
2. We use a lot of VIRT on each node, but the memory is relatively sufficient (in other words, I don’t think that when the memory is insufficient, elasticsearch’s logic here will deadlock)

I saw on the GitHub issue that you have not followed best practices and disabled swap and made sure Elasticsearch has full access to the memory of the host/VM.

As you seem to have an issue with refreshing it would also help if you could provide information about the specification of the cluster, especially the type of storage used. You say that the document count does not change but the thread mentions refreshes. Are you actively updating documents in the index? Some information about the use case might help provide context.

deserts · October 31, 2024, 8:32am

@Christian_Dahlqvist
Thank you for your reply. Indeed, not turning off swap and turning on memory locking is a deployment error. However, as we discussed before, we can suspect that this is related, but without solid evidence, we cannot conclude that this is the cause.
The basic information of my environment is:
Elasticsearch version: 7.17.1
Cpu architecture: aarch64
All disk types are SSD
java version "1.8.0_333"
Java(TM) SE Runtime Environment (build 1.8.0_333-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.333-b02, mixed mode)

Cluster overview: 9 nodes, including 6 data nodes and 3 master nodes(Server resources and jvm resources are absolutely sufficient.)
Index(my_index) overview: 9 primary shards, 1 replica shard

Each index has about 10 million data items. I do not call the _flush or _refresh API to manually intervene in the default behavior of the cluster when writing documents. Frankly speaking, the configuration of this cluster is almost the same as the factory default configuration.

To be honest, what I suspect most now is that there is an exception in the call at the operating system level, because I used strace to track the memory page fault interrupt, which seems to be related to the bytecode generated by the libjvm.so library I called.

strace -o strace.txt -fp es-datanode-pid -e trace=stat,open,unlink,close

deserts · October 31, 2024, 8:41am

I don't know if there is a deadlock with the wamer thread here.

The org.elasticsearch.common.cache.Cache#get(K, long, boolean) method is executed in the promote method, that is, under the else conditional branch, there is a lock in the promote method

Christian_Dahlqvist · October 31, 2024, 8:50am

Fixing this and checking if it has any impact on the issue would allow us to confirm or rule out whether it is a contributing factor.

Is this local SSD or some kind of networked SSD based storage?

How much resources are assigned to the different node types? Is this an on-prem or cloud deployment?

Based on what you have said I assume you are updating but not adding new data to each index, is that correct? If so, how frequently are you updating? Are you using parent-child or nested documents?

deserts · October 31, 2024, 9:05am

@Christian_Dahlqvist
I agree with your point of view. Swap and memory lock will be a key idea and direction for us to troubleshoot this problem. I'm sorry that I didn't express the information of my cluster clearly and accurately. Please allow me to explain:
The disks used in our cluster are all local devices rather than network shared devices such as NAS. As mentioned before, they are all SSD disks inserted in physical servers.
For the deployment architecture, we all use local physical servers for deployment, without any cloud resources (otherwise I will pull the supplier to assist us in troubleshooting from the operating system level haha)
For the index, the situation is more complicated. My business creates a new index every 7 days. Generally, each new index will be written within 7 days. There are about 10 million docs. In this index in the last 7 days, I am always doing write, update, and delete operations. However, once 7 days have passed, the operation will be switched to the next index. This behavior is controlled by the client call.
I have allocated 32g of heap memory to each datanode. The physical server where they are located has a 64-core cpu. There is one thing I must point out: no node in our cluster plays both masternode and datanode. However, there is a mixed deployment architecture in the cluster without problems. I am not sure if it is related to this.
Finally, do you have a chat tool like telegram? I am willing to share some information with you in real time to better solve the problem (this is paid)

Christian_Dahlqvist · October 31, 2024, 9:17am

Excellent.

I am assuming the inconsistency only applies to the index/indices being actively written to? Do you experience the issue if you only target indices that have refreshed and are no longer written to?

It is important to note that refreshes are not synchronised across replicas, so primary and replica shards will make changes available for serach at different times. As you are adding as well as deleting you will likely see inconsistencies if some changes have refreshed on the primary but not the reoplica and vice versa. If you want to improve consustency across multiple queries I would recommend you use preference so the same set of shards are involved in the query.

I am just a volunteer, so only log in from time to time.

deserts · October 31, 2024, 9:39am

@Christian_Dahlqvist
We found that the abnormal situation is that when using the same query conditions for several indexes last month (from a business perspective, I can guarantee that the docs in them will not change, that is, they will not be updated, added, or deleted), different query results are obtained. After investigation, it is found that the refresh thread pool queue is full. Although I also suspected that it was related to synchronous writing and deletion, the indexes actually pointed to by these operations are the latest indexes in the past 7 days. I don’t know what impact it will have on historical indexes (I personally think it has no impact, because I saw that the searchable status of some segments of those historical indexes is false. Maybe there was a problem with the cluster at that time?)
As I shared a lot of information on github, we specified preference on the client, and in the search scenario, used hash(id) to set its value, but I think this is just a means to solve the problem of stabilizing the results obtained by the same query conditions, and it is not the reason for my refresh thread pool to be full (I think there is a deadlock because the number of tasks will not decrease). In other words, even if I don’t specify preference, in an index where the data will not change again, do you think track_total_hits will get inconsistent totals? I think this should not be the case.

deserts · October 31, 2024, 9:41am

OK, I still hope to communicate with you in real time, because I am not very familiar with elasticsearch. - ，- @Christian_Dahlqvist

Topic		Replies	Views
Different results when hitting same query multiple times Elasticsearch	1	717	June 1, 2017
Query randomly returns empty and right results Elasticsearch	4	3311	April 28, 2019
Search query on elasticsearch returning different output every time Elasticsearch	4	1062	November 2, 2019
Inconsistent search results Elasticsearch	8	3837	July 6, 2017
Different responses to same query Elasticsearch	3	356	July 6, 2017

The same search statement multiple times gets different total results

Related topics