We have a 2-node developmental ES cluster using 0.18.2. The two nodes
were restarted last week due to the host VM needing a reboot. Both
instances were terminated cleanly using the wrapper and then restarted
Discovered earlier today that the cluster did not come up cleanly. One
issue was that the ulimit setting was not saved and restarting the
server caused the ulimit to revert back to the original limit of 1024.
Too many files was a common issue in the log files. Another issue was
the log files themselves. The errors eventually consumed the server,
taking up all the disk space.
Deleted all the log files and various hprof files and restarted the
cluster. The servers came up cleanly individually, but fail when both
are started and discover (via unicast) each other. The failures are
shards that "should exists, but doesn't"
There are several indices present. Most have 5 shards, some have 1
replica, some have 0. The index in the gist with the failure is set
for 1 replica. There was another problematic index that was set for 0
replicas. One of the shards had most (all?) of the items in the index:
85gb. That index was deleted. Prior to the deletion, there was various
other errors (MasterNotFound), but now the errors are consistent.
What else can be done to recover the shards or discover where exactly
the problem lies?