I've noticed that if I restart a cluster with existing data,
ES-HEAD and now also MARVEL reports the cluster status as "red" with only
one node and a large number of what I think are supposed to be shards, far
more than what was reported when the cluster was running before shutdown.
Despite the cluster "Red Status" ES-HEAD easily reports the
individual nodes in the cluster as healthy and working
Am speculating that this might be because each node's ID has changed when
it started up anew although I've fixed the node name (label).
Is there a solution to this, or does data have to be fed anew whenever a
cluster is restarted?
I supposed a corollary might be if it's advisable to purge old cluster
data when it's started up anew, is this based on transaction logs or
something else?
Normal cluster restarts don't cause status red to occur. If all the
individual nodes are reporting green, then there isn't data loss. Make sure
you aren't suffering from a split-brain problem or accidentally connecting
to an unrelated cluster in the same data center.
On Monday, February 3, 2014 12:38:07 PM UTC-5, Tony Su wrote:
I've noticed that if I restart a cluster with existing data,
ES-HEAD and now also MARVEL reports the cluster status as "red" with only
one node and a large number of what I think are supposed to be shards, far
more than what was reported when the cluster was running before shutdown.
Despite the cluster "Red Status" ES-HEAD easily reports the
individual nodes in the cluster as healthy and working
Am speculating that this might be because each node's ID has changed when
it started up anew although I've fixed the node name (label).
Is there a solution to this, or does data have to be fed anew whenever
a cluster is restarted?
I supposed a corollary might be if it's advisable to purge old cluster
data when it's started up anew, is this based on transaction logs or
something else?
OK,
Thx. If it's not normal then I'll try to inspect my steps more closely. I
was able to replicate my scenario once (beyond when it was first observed).
The way I have it set up, there shouldn't be any possibility of a
split-brain (are you referring to name resolution?) or an unrelated cluster.
Note I'm not talking about simply restarting ES nodes immediately which
could mean re-using PIDs and CGROUP IDs. I'm talking about a complete
shutdown and restarting the cluster which I'm speculating likely creates
new node IDs.
Tony
On Monday, February 3, 2014 10:16:28 AM UTC-8, John Ohno wrote:
Normal cluster restarts don't cause status red to occur. If all the
individual nodes are reporting green, then there isn't data loss. Make sure
you aren't suffering from a split-brain problem or accidentally connecting
to an unrelated cluster in the same data center.
On Monday, February 3, 2014 12:38:07 PM UTC-5, Tony Su wrote:
I've noticed that if I restart a cluster with existing data,
ES-HEAD and now also MARVEL reports the cluster status as "red" with only
one node and a large number of what I think are supposed to be shards, far
more than what was reported when the cluster was running before shutdown.
Despite the cluster "Red Status" ES-HEAD easily reports the
individual nodes in the cluster as healthy and working
Am speculating that this might be because each node's ID has changed when
it started up anew although I've fixed the node name (label).
Is there a solution to this, or does data have to be fed anew whenever
a cluster is restarted?
I supposed a corollary might be if it's advisable to purge old cluster
data when it's started up anew, is this based on transaction logs or
something else?
You're not doing a kill -9 during shutdown, I hope. If so, that would
result in a large window of opportunity for index corruption.
Just something to check for...
We always do a normal kill to the pid within the pid file to shut down an
ES instance before shutting down the machine itself, or before upgrading
the software.And we have never seen any issues with the cluster coming back
up in the same (usable, usually yellow or green) state that it was before
the shutdown.
On two occasions we have had machines power off due to thermal overload in
the server room. This is a drastic event that is usually as dangerous (to
disk data integrity) as a kill -9, but in these cases there wasn't any load
on the machine and we experienced no data loss nor did we see the cluster
as anything but green once the machine came back up and the node restarted.
Thx for the input.
Nope, ES is being shutdown "normally" usually by simply stopping the
configured ES service, and only after it fully completes executing a
shutdown.
Tony
On Monday, February 3, 2014 2:09:08 PM UTC-8, InquiringMind wrote:
Tony,
You're not doing a kill -9 during shutdown, I hope. If so, that would
result in a large window of opportunity for index corruption.
Just something to check for...
We always do a normal kill to the pid within the pid file to shut down an
ES instance before shutting down the machine itself, or before upgrading
the software.And we have never seen any issues with the cluster coming back
up in the same (usable, usually yellow or green) state that it was before
the shutdown.
On two occasions we have had machines power off due to thermal overload in
the server room. This is a drastic event that is usually as dangerous (to
disk data integrity) as a kill -9, but in these cases there wasn't any load
on the machine and we experienced no data loss nor did we see the cluster
as anything but green once the machine came back up and the node restarted.
If you bring down a whole cluster and start it back up, it may be that
during the start process the cluster is red. The reason is that until all
nods have rejoined some data may not be (yet) available for searching. This
should be resolve as soon as all the nodes are back (potentially earlier
depending on your replication settings)
Though not recommended - kill -9 should not result in data loss. If so
it's a bug and should be reported.
On Monday, February 3, 2014 11:15:24 PM UTC+1, Tony Su wrote:
Thx for the input.
Nope, ES is being shutdown "normally" usually by simply stopping the
configured ES service, and only after it fully completes executing a
shutdown.
Tony
On Monday, February 3, 2014 2:09:08 PM UTC-8, InquiringMind wrote:
Tony,
You're not doing a kill -9 during shutdown, I hope. If so, that would
result in a large window of opportunity for index corruption.
Just something to check for...
We always do a normal kill to the pid within the pid file to shut down an
ES instance before shutting down the machine itself, or before upgrading
the software.And we have never seen any issues with the cluster coming back
up in the same (usable, usually yellow or green) state that it was before
the shutdown.
On two occasions we have had machines power off due to thermal overload
in the server room. This is a drastic event that is usually as dangerous
(to disk data integrity) as a kill -9, but in these cases there wasn't any
load on the machine and we experienced no data loss nor did we see the
cluster as anything but green once the machine came back up and the node
restarted.
I've restarted the cluster a couple times since and not seen what I saw
before.
Been reading more of the documentation, am going to set the "min-max
master" to 3 which is suggested for a 5 node cluster.
Currently speculating, although I thought I've been very careful to start
the master significantly before any other node, something may have happened
the time that causes a persistently red status.
Questions related to this general topic (restarting a cluster)
Q - Once a cluster has started up with a different node as the master, is
there persistence in continuing to assign that role to that node or is it
completely arbitrary on every startup (ie what are the attributes an
election is based on)?
Q - If a cluster has started up with the wrong nodes with the Master role,
is it possible or advisable to try to modify their roles while the cluster
is running or is it advisable to shutdown the cluster, re-configure and
start up again?
Thx,
Tony
On Tuesday, February 4, 2014 2:27:30 AM UTC-8, Boaz Leskes wrote:
A couple of points:
If you bring down a whole cluster and start it back up, it may be that
during the start process the cluster is red. The reason is that until all
nods have rejoined some data may not be (yet) available for searching. This
should be resolve as soon as all the nodes are back (potentially earlier
depending on your replication settings)
Though not recommended - kill -9 should not result in data loss. If
so it's a bug and should be reported.
On Monday, February 3, 2014 11:15:24 PM UTC+1, Tony Su wrote:
Thx for the input.
Nope, ES is being shutdown "normally" usually by simply stopping the
configured ES service, and only after it fully completes executing a
shutdown.
Tony
On Monday, February 3, 2014 2:09:08 PM UTC-8, InquiringMind wrote:
Tony,
You're not doing a kill -9 during shutdown, I hope. If so, that would
result in a large window of opportunity for index corruption.
Just something to check for...
We always do a normal kill to the pid within the pid file to shut down
an ES instance before shutting down the machine itself, or before upgrading
the software.And we have never seen any issues with the cluster coming back
up in the same (usable, usually yellow or green) state that it was before
the shutdown.
On two occasions we have had machines power off due to thermal overload
in the server room. This is a drastic event that is usually as dangerous
(to disk data integrity) as a kill -9, but in these cases there wasn't any
load on the machine and we experienced no data loss nor did we see the
cluster as anything but green once the machine came back up and the node
restarted.
It's good you're going to use the minimum_master_nodes settings. When this
number of master eligible nodes have started (more on this in a second),
one will be picked up randomly and that will stay so until that elected
master becomes unreachable (= shutdown).
If you want to control which nodes can become master, you can use the
node.master setting in the elasticsearch.yml and set it to false. Only
nodes that has this set to true can become master. True is the default
which makes all nodes viable. It is important to note that the minimum
master nodes setting relates to the number of node in the cluster which
have node.master set to true, not all the nodes in the cluster - so change
it accordingly.
I've restarted the cluster a couple times since and not seen what I saw
before.
Been reading more of the documentation, am going to set the "min-max
master" to 3 which is suggested for a 5 node cluster.
Currently speculating, although I thought I've been very careful to start
the master significantly before any other node, something may have happened
the time that causes a persistently red status.
Questions related to this general topic (restarting a cluster)
Q - Once a cluster has started up with a different node as the master, is
there persistence in continuing to assign that role to that node or is it
completely arbitrary on every startup (ie what are the attributes an
election is based on)?
Q - If a cluster has started up with the wrong nodes with the Master role,
is it possible or advisable to try to modify their roles while the cluster
is running or is it advisable to shutdown the cluster, re-configure and
start up again?
Thx,
Tony
On Tuesday, February 4, 2014 2:27:30 AM UTC-8, Boaz Leskes wrote:
A couple of points:
If you bring down a whole cluster and start it back up, it may be that
during the start process the cluster is red. The reason is that until all
nods have rejoined some data may not be (yet) available for searching. This
should be resolve as soon as all the nodes are back (potentially earlier
depending on your replication settings)
Though not recommended - kill -9 should not result in data loss. If
so it's a bug and should be reported.
On Monday, February 3, 2014 11:15:24 PM UTC+1, Tony Su wrote:
Thx for the input.
Nope, ES is being shutdown "normally" usually by simply stopping the
configured ES service, and only after it fully completes executing a
shutdown.
Tony
On Monday, February 3, 2014 2:09:08 PM UTC-8, InquiringMind wrote:
Tony,
You're not doing a kill -9 during shutdown, I hope. If so, that would
result in a large window of opportunity for index corruption.
Just something to check for...
We always do a normal kill to the pid within the pid file to shut down
an ES instance before shutting down the machine itself, or before upgrading
the software.And we have never seen any issues with the cluster coming back
up in the same (usable, usually yellow or green) state that it was before
the shutdown.
On two occasions we have had machines power off due to thermal overload
in the server room. This is a drastic event that is usually as dangerous
(to disk data integrity) as a kill -9, but in these cases there wasn't any
load on the machine and we experienced no data loss nor did we see the
cluster as anything but green once the machine came back up and the node
restarted.
*2) Though not recommended - kill -9 should not result in data loss. If
so it's a bug and should be reported.*
It should not, but it may. A kill -9 ends a process without allowing it
to flush any unwritten buffers to disk, close any open files, or even
finish writing what it started. No process can detect or capture it;
therefore no process can perform any cleanup, shutdown, or completion.
So file all the bugs you wish, but there is no code change that can be made
to detect or handle a kill -9. Nothing in the Java code, and nothing in the
underlying JVM that is the process itself. Unless ES is redesigned so that
any given disk block can be written or not and the entire index remains
fully consistent. Because while the process cannot detect a kill -9, the OS
waits until it returns from any kernel call before ripping the rug out from
under it.
Kill -9 just dangerous. No, it's not a guarantee of disaster. But the same
could be said about walking blindfolded across the Autobahn.
Good stuff about data integrity if a fool or disaster strikes.
Maybe down the road it would be important to document the atomicity of ES
transactions (I understand there are likely higher priorities now, and just
ensuring integrity needs to be done before documentation).
Tony
On Tuesday, February 4, 2014 7:20:04 AM UTC-8, InquiringMind wrote:
*2) Though not recommended - kill -9 should not result in data loss. If
so it's a bug and should be reported.*
It should not, but it may. A kill -9 ends a process without allowing
it to flush any unwritten buffers to disk, close any open files, or even
finish writing what it started. No process can detect or capture it;
therefore no process can perform any cleanup, shutdown, or completion.
So file all the bugs you wish, but there is no code change that can be
made to detect or handle a kill -9. Nothing in the Java code, and nothing
in the underlying JVM that is the process itself. Unless ES is redesigned
so that any given disk block can be written or not and the entire index
remains fully consistent. Because while the process cannot detect a kill
-9, the OS waits until it returns from any kernel call before ripping the
rug out from under it.
Kill -9 just dangerous. No, it's not a guarantee of disaster. But the same
could be said about walking blindfolded across the Autobahn.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.