A week ago, we had to recreate one out of 3 nodes of our ELK cluster due to and exception trying to stop the service (docker container). To solve it i had to kill the container process and then recreate the node. To let the node work with the same data directory i had to delete the lock files. I know that this practices are not recommended, but at the time were the only way i found to solve the issue.
After all this situation great majority of the shards web assigned, but some replica shards were not. One week later we keep on fighting with some of this replica shards. 70% of them are them are from indices with a size about 100GB. From time to time relocation fail and no reason is shown in "explanation".
Any advice?
Thanks in advanced.