Error restarting cluster

Ivan · February 28, 2012, 7:09am

We have a 2-node developmental ES cluster using 0.18.2. The two nodes
were restarted last week due to the host VM needing a reboot. Both
instances were terminated cleanly using the wrapper and then restarted
upon bootup.

Discovered earlier today that the cluster did not come up cleanly. One
issue was that the ulimit setting was not saved and restarting the
server caused the ulimit to revert back to the original limit of 1024.
Too many files was a common issue in the log files. Another issue was
the log files themselves. The errors eventually consumed the server,
taking up all the disk space.

Deleted all the log files and various hprof files and restarted the
cluster. The servers came up cleanly individually, but fail when both
are started and discover (via unicast) each other. The failures are
shards that "should exists, but doesn't"

https://gist.github.com/1930264

There are several indices present. Most have 5 shards, some have 1
replica, some have 0. The index in the gist with the failure is set
for 1 replica. There was another problematic index that was set for 0
replicas. One of the shards had most (all?) of the items in the index:
85gb. That index was deleted. Prior to the deletion, there was various
other errors (MasterNotFound), but now the errors are consistent.

What else can be done to recover the shards or discover where exactly
the problem lies?

Cheers,

Ivan

kimchy · February 29, 2012, 1:48pm

There were some bugs related to that that were fixed in newer versions.

On Tuesday, February 28, 2012 at 9:09 AM, Ivan Brusic wrote:

We have a 2-node developmental ES cluster using 0.18.2. The two nodes
were restarted last week due to the host VM needing a reboot. Both
instances were terminated cleanly using the wrapper and then restarted
upon bootup.

Discovered earlier today that the cluster did not come up cleanly. One
issue was that the ulimit setting was not saved and restarting the
server caused the ulimit to revert back to the original limit of 1024.
Too many files was a common issue in the log files. Another issue was
the log files themselves. The errors eventually consumed the server,
taking up all the disk space.

Deleted all the log files and various hprof files and restarted the
cluster. The servers came up cleanly individually, but fail when both
are started and discover (via unicast) each other. The failures are
shards that "should exists, but doesn't"

https://gist.github.com/1930264

There are several indices present. Most have 5 shards, some have 1
replica, some have 0. The index in the gist with the failure is set
for 1 replica. There was another problematic index that was set for 0
replicas. One of the shards had most (all?) of the items in the index:
85gb. That index was deleted. Prior to the deletion, there was various
other errors (MasterNotFound), but now the errors are consistent.

What else can be done to recover the shards or discover where exactly
the problem lies?

Cheers,

Ivan

Ivan · February 29, 2012, 6:13pm

Thanks. I saw there was a fix in a 0.17.x release, but nothing after
0.18.2. I did notice that 0.19 has updated logic for shard metadata
storage.

0.19.0.RC3 did not work with the existing data directory (as
expected). The data was not important, but this scenario would not be
ideal for a production environment. Is there any way to save the data?
What was the root cause of the issue to begin with? Starting it with
the wrong ulimit?

--
Ivan

On Wed, Feb 29, 2012 at 5:48 AM, Shay Banon kimchy@gmail.com wrote:

There were some bugs related to that that were fixed in newer versions.

On Tuesday, February 28, 2012 at 9:09 AM, Ivan Brusic wrote:

We have a 2-node developmental ES cluster using 0.18.2. The two nodes
were restarted last week due to the host VM needing a reboot. Both
instances were terminated cleanly using the wrapper and then restarted
upon bootup.

Discovered earlier today that the cluster did not come up cleanly. One
issue was that the ulimit setting was not saved and restarting the
server caused the ulimit to revert back to the original limit of 1024.
Too many files was a common issue in the log files. Another issue was
the log files themselves. The errors eventually consumed the server,
taking up all the disk space.

Deleted all the log files and various hprof files and restarted the
cluster. The servers came up cleanly individually, but fail when both
are started and discover (via unicast) each other. The failures are
shards that "should exists, but doesn't"

https://gist.github.com/1930264

There are several indices present. Most have 5 shards, some have 1
replica, some have 0. The index in the gist with the failure is set
for 1 replica. There was another problematic index that was set for 0
replicas. One of the shards had most (all?) of the items in the index:
85gb. That index was deleted. Prior to the deletion, there was various
other errors (MasterNotFound), but now the errors are consistent.

What else can be done to recover the shards or discover where exactly
the problem lies?

Cheers,

Ivan

kimchy · March 1, 2012, 12:31pm

Can you see if it happens with 0.19? It should be simple to recreate if you simply start it with a low ulimit and cause the mentioned problems? Somehow, the shard data got removed, and when trying to allocate it, that data was not there.

On Wednesday, February 29, 2012 at 8:13 PM, Ivan Brusic wrote:

Thanks. I saw there was a fix in a 0.17.x release, but nothing after
0.18.2. I did notice that 0.19 has updated logic for shard metadata
storage.

0.19.0.RC3 did not work with the existing data directory (as
expected). The data was not important, but this scenario would not be
ideal for a production environment. Is there any way to save the data?
What was the root cause of the issue to begin with? Starting it with
the wrong ulimit?

--
Ivan

On Wed, Feb 29, 2012 at 5:48 AM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:

There were some bugs related to that that were fixed in newer versions.

On Tuesday, February 28, 2012 at 9:09 AM, Ivan Brusic wrote:

We have a 2-node developmental ES cluster using 0.18.2. The two nodes
were restarted last week due to the host VM needing a reboot. Both
instances were terminated cleanly using the wrapper and then restarted
upon bootup.

Discovered earlier today that the cluster did not come up cleanly. One
issue was that the ulimit setting was not saved and restarting the
server caused the ulimit to revert back to the original limit of 1024.
Too many files was a common issue in the log files. Another issue was
the log files themselves. The errors eventually consumed the server,
taking up all the disk space.

Deleted all the log files and various hprof files and restarted the
cluster. The servers came up cleanly individually, but fail when both
are started and discover (via unicast) each other. The failures are
shards that "should exists, but doesn't"

https://gist.github.com/1930264

There are several indices present. Most have 5 shards, some have 1
replica, some have 0. The index in the gist with the failure is set
for 1 replica. There was another problematic index that was set for 0
replicas. One of the shards had most (all?) of the items in the index:
85gb. That index was deleted. Prior to the deletion, there was various
other errors (MasterNotFound), but now the errors are consistent.

What else can be done to recover the shards or discover where exactly
the problem lies?

Cheers,

Ivan

kimchy · March 1, 2012, 1:15pm

I just ran several tests trying to recreate it (with 0.19). I basically started a cluster, and indexed data into ti. Then, started the cluster with a low ulimit value. Obviously it failed, but no data was lost.

On Thursday, March 1, 2012 at 2:31 PM, Shay Banon wrote:

Can you see if it happens with 0.19? It should be simple to recreate if you simply start it with a low ulimit and cause the mentioned problems? Somehow, the shard data got removed, and when trying to allocate it, that data was not there.

On Wednesday, February 29, 2012 at 8:13 PM, Ivan Brusic wrote:

Thanks. I saw there was a fix in a 0.17.x release, but nothing after
0.18.2. I did notice that 0.19 has updated logic for shard metadata
storage.

0.19.0.RC3 did not work with the existing data directory (as
expected). The data was not important, but this scenario would not be
ideal for a production environment. Is there any way to save the data?
What was the root cause of the issue to begin with? Starting it with
the wrong ulimit?

--
Ivan

On Wed, Feb 29, 2012 at 5:48 AM, Shay Banon <kimchy@gmail.com (mailto:kimchy@gmail.com)> wrote:

There were some bugs related to that that were fixed in newer versions.

On Tuesday, February 28, 2012 at 9:09 AM, Ivan Brusic wrote:

We have a 2-node developmental ES cluster using 0.18.2. The two nodes
were restarted last week due to the host VM needing a reboot. Both
instances were terminated cleanly using the wrapper and then restarted
upon bootup.

Discovered earlier today that the cluster did not come up cleanly. One
issue was that the ulimit setting was not saved and restarting the
server caused the ulimit to revert back to the original limit of 1024.
Too many files was a common issue in the log files. Another issue was
the log files themselves. The errors eventually consumed the server,
taking up all the disk space.

Deleted all the log files and various hprof files and restarted the
cluster. The servers came up cleanly individually, but fail when both
are started and discover (via unicast) each other. The failures are
shards that "should exists, but doesn't"

https://gist.github.com/1930264

There are several indices present. Most have 5 shards, some have 1
replica, some have 0. The index in the gist with the failure is set
for 1 replica. There was another problematic index that was set for 0
replicas. One of the shards had most (all?) of the items in the index:
85gb. That index was deleted. Prior to the deletion, there was various
other errors (MasterNotFound), but now the errors are consistent.

What else can be done to recover the shards or discover where exactly
the problem lies?

Cheers,

Ivan