I am not sure if there is a teacher who is familiar with the elasticsearch source code to help me look at this issue I raised on github. At present, I cannot get stable query results using the same conditions. The reason is that the refresh thread pool of a datanode is full.
@DavidTurner
Unfortunately, this question has been raised in this forum for almost a month, and no one has paid attention to it. - ,-
In addition, I don’t think that the fact that no other users have encountered this problem means that this version does not have this problem. Perhaps the ARM architecture CPU and operating system I am using are also worthy of attention?
Regarding the thread stack, I used jstack to collect all the information when this problem occurred. Please give me some guidance if you have time.
https://github.com/killersteps/jstack-dump/blob/main/jstack-2.txt
1. Swap is not turned off on our node servers
2. We use a lot of VIRT on each node, but the memory is relatively sufficient (in other words, I don’t think that when the memory is insufficient, elasticsearch’s logic here will deadlock)
I saw on the GitHub issue that you have not followed best practices and disabled swap and made sure Elasticsearch has full access to the memory of the host/VM.
As you seem to have an issue with refreshing it would also help if you could provide information about the specification of the cluster, especially the type of storage used. You say that the document count does not change but the thread mentions refreshes. Are you actively updating documents in the index? Some information about the use case might help provide context.
@Christian_Dahlqvist
Thank you for your reply. Indeed, not turning off swap and turning on memory locking is a deployment error. However, as we discussed before, we can suspect that this is related, but without solid evidence, we cannot conclude that this is the cause.
The basic information of my environment is:
Elasticsearch version: 7.17.1
Cpu architecture: aarch64
All disk types are SSD
java version "1.8.0_333"
Java(TM) SE Runtime Environment (build 1.8.0_333-b02)
Java HotSpot(TM) 64-Bit Server VM (build 25.333-b02, mixed mode)
Cluster overview: 9 nodes, including 6 data nodes and 3 master nodes(Server resources and jvm resources are absolutely sufficient.)
Index(my_index) overview: 9 primary shards, 1 replica shard
Each index has about 10 million data items. I do not call the _flush or _refresh API to manually intervene in the default behavior of the cluster when writing documents. Frankly speaking, the configuration of this cluster is almost the same as the factory default configuration.
To be honest, what I suspect most now is that there is an exception in the call at the operating system level, because I used strace to track the memory page fault interrupt, which seems to be related to the bytecode generated by the libjvm.so library I called.
strace -o strace.txt -fp es-datanode-pid -e trace=stat,open,unlink,close
I don't know if there is a deadlock with the wamer thread here.
The org.elasticsearch.common.cache.Cache#get(K, long, boolean) method is executed in the promote method, that is, under the else conditional branch, there is a lock in the promote method
Fixing this and checking if it has any impact on the issue would allow us to confirm or rule out whether it is a contributing factor.
Is this local SSD or some kind of networked SSD based storage?
How much resources are assigned to the different node types? Is this an on-prem or cloud deployment?
Based on what you have said I assume you are updating but not adding new data to each index, is that correct? If so, how frequently are you updating? Are you using parent-child or nested documents?
@Christian_Dahlqvist
I agree with your point of view. Swap and memory lock will be a key idea and direction for us to troubleshoot this problem. I'm sorry that I didn't express the information of my cluster clearly and accurately. Please allow me to explain:
The disks used in our cluster are all local devices rather than network shared devices such as NAS. As mentioned before, they are all SSD disks inserted in physical servers.
For the deployment architecture, we all use local physical servers for deployment, without any cloud resources (otherwise I will pull the supplier to assist us in troubleshooting from the operating system level haha)
For the index, the situation is more complicated. My business creates a new index every 7 days. Generally, each new index will be written within 7 days. There are about 10 million docs. In this index in the last 7 days, I am always doing write, update, and delete operations. However, once 7 days have passed, the operation will be switched to the next index. This behavior is controlled by the client call.
I have allocated 32g of heap memory to each datanode. The physical server where they are located has a 64-core cpu. There is one thing I must point out: no node in our cluster plays both masternode and datanode. However, there is a mixed deployment architecture in the cluster without problems. I am not sure if it is related to this.
Finally, do you have a chat tool like telegram? I am willing to share some information with you in real time to better solve the problem (this is paid)
Excellent.
I am assuming the inconsistency only applies to the index/indices being actively written to? Do you experience the issue if you only target indices that have refreshed and are no longer written to?
It is important to note that refreshes are not synchronised across replicas, so primary and replica shards will make changes available for serach at different times. As you are adding as well as deleting you will likely see inconsistencies if some changes have refreshed on the primary but not the reoplica and vice versa. If you want to improve consustency across multiple queries I would recommend you use preference so the same set of shards are involved in the query.
I am just a volunteer, so only log in from time to time.
@Christian_Dahlqvist
We found that the abnormal situation is that when using the same query conditions for several indexes last month (from a business perspective, I can guarantee that the docs in them will not change, that is, they will not be updated, added, or deleted), different query results are obtained. After investigation, it is found that the refresh thread pool queue is full. Although I also suspected that it was related to synchronous writing and deletion, the indexes actually pointed to by these operations are the latest indexes in the past 7 days. I don’t know what impact it will have on historical indexes (I personally think it has no impact, because I saw that the searchable status of some segments of those historical indexes is false. Maybe there was a problem with the cluster at that time?)
As I shared a lot of information on github, we specified preference on the client, and in the search scenario, used hash(id) to set its value, but I think this is just a means to solve the problem of stabilizing the results obtained by the same query conditions, and it is not the reason for my refresh thread pool to be full (I think there is a deadlock because the number of tasks will not decrease). In other words, even if I don’t specify preference, in an index where the data will not change again, do you think track_total_hits will get inconsistent totals? I think this should not be the case.
OK, I still hope to communicate with you in real time, because I am not very familiar with elasticsearch. - ,- @Christian_Dahlqvist