Hello, everyone. This cluster has been running for a long time. Today pending tasks pile up too many tasks. Check the master log and find the following warn message. The master waits for the response of the bdes225 node after sending ClusterState. Until the master waits for the timeout.
Thanks, I've found the cause of the problem. But in this case, the thread will go down or the JVM will exit.
These errors were reported for several hours and the pending tasks piled up tens of thousands, seriously affecting the production environment.
I don't think it should affect the whole cluster, so I mentioned a PR.
Our cluster metadata operation is particularly frequent. The operation of metadata is single-thread serial execution and finally passes the allocation module. I optimized the logic of this part, modified the source code and compiled it.
The reason for the problem is that I replaced the original jar, but for special reasons this node process did not restart, I reproduce this problem in the test environment. Not a normal problem, sorry.
Right, so as suspected you've modified Elasticsearch and broke it. We can not support that.
The correct behavior for Elasticsearch when something that we can not recover from happens, is for the process to die. That is what the uncaught exception handler does. A NoClassDefFoundError is unrecoverable.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.