I am currently managing a 12-node ES cluster (version 2.1.1). Yesterday, one node, when restarted, would start up without any shards. That means that the Marvel plugins lists 0 shards for that node on the “Nodes” page, and _cat/shards|grep matches zero lines.
I turned on logging on DEBUG but did not manage to find anything related to the physical data storage that was in any way helpful in fixing this problem, or at least diagnosing what the problem actually is.
The data on the disc looks okay; its directory structure matches what’s on other nodes, file sizes seem to be in the right regions (no empty or uberfull directories), permissions are the same everywhere.
The “solution” for this node was to simply remove the data directory and let ES rebalance itself. (Luckily I finally got around to setting every index to at least one replica some hours earlier.)
However, this morning a second node did the exact same thing: after a restart it would not have any shards. This time somebody else in the company started to manually assign the unassigned shards to that node so I can’t diagnose this any further now. Then again, there are still some nodes left that I need to restart to update some configuration parameters… and I have kind of a bad feeling here.
Now, what exactly is going on with a node that suddenly forgets its shards? How can I make ES use the data that exists on the disk and seems to be in pristine condition? How can I make ES tell me in the logfile what it’s doing with the storage when it loads shard metadata and what (and hopefully why) it fails?