Ensuring Gateway integrity

Hey,
Things are going great with ES for us, but I wanted to bring up a
few dark spots we've encountered where there may be room for
improvement within ES or on our side.

We've hit some recurring issues in non-production enviroments that are
really making us think long and hard about how we are backing up and
monitoring our system. When we hit these cases, it seems that all bets
are off when it comes to gateway data integrity. I am not suggesting
that ES needs to gracefully handle all these cases, but these are
things others will run into, as well.

Things that should be easily preventable:

  • Max file handles
    This shouldn't happen on a system that is properly configured (Most
    distros do need to bump up the limit from the 1024 default). We
    haven't had a problem with this since, we've got this set
    appropriately, but have corrupted our gateway and local nodes because
    of ths. I'd think it'd be pretty easy for ES to check the ulimit on
    startup and either loudly log the value is too low or refuse to start
    up. This would save some people some pain. I believe I read in another
    thread there is a lucene fix to prevent index corruption when this
    occurs, but isn't in a 3x release, yet.

  • Gateway disk space fills up
    We hit this when we'd made a backup copy of the gateway on the same
    drive, forgot, and loaded a lot more content. We've added monitoring
    to our systems to start alarming when the gateway free space is less
    than 10%.

Not as easily preventable:

  • Network saturation
    Running a couple of nodes on the sam ESX host and another on a
    different, we ended up completely saturating the single GIGe
    connection, which ended up causing a network partion and split brain.
    This also seemed to lead to gateway corruption. Both halves of the
    cluster were receiving documents to index and were both working off
    the same gateway. I believe that this led to two masters for a single
    shard who then did not play nicely with the gateway and ended up
    corrupting it. We've ended up isolating internal ES traffic to its own
    NIC and all external traffic to cluster for indexing and searching to
    a seperate interface.

However, are there any safeguards in place to prevent two nodes from
working on the same gateway data? I believe that by setting the min
number of nodes setting may help with this. So in a 3 node cluster, if
you always required 2 nodes, this would guarantee that you never get
two split clusters.

We do have code in place that attempts to detect the split cluster
condition by iterating over the nodes in the cluster and using the
transport client to validate the node count and comparing that to the
node count from all other servers. Not sure if there is a better way
to do this.

Gateway Safeguards:
In order to make sure that we can quickly recover from any of the
scenarios, we are looking at having redundancy in the gateway with a
main gateway that rsyncs to a backup gateway at certain intervals,
probably daily. This would hopefully allow us to switch to the back up
gateway, and re-run any docs since the last rsync, to allow recovery
time to be around a hour vs a day to rebuild ~100GB of content(ES is
not the bottleneck in the content rebuild, our backend storage is).

I read in a previous thread that you can take a hot copy of the
gateway and it will always be valid due to the usage of commit points
and append only operations. Is this always 100% guaranteed, even if it
takes s long time? For instance, if a merge occurs while the copy is
occurring or if a new commit point is written while the copy is
occurring?

Are there any other recommended safeguards that to help ensure gateway
integrity?

Thanks!
Paul

Hi Paul

good points all

I read in a previous thread that you can take a hot copy of the
gateway and it will always be valid due to the usage of commit points
and append only operations. Is this always 100% guaranteed, even if it
takes s long time? For instance, if a merge occurs while the copy is
occurring or if a new commit point is written while the copy is
occurring?

What I'd add to this is the ability to add an index from this backup
without having to edit the metadata and restart the cluster.

We can already have 'closed' indices, but the cluster still needs to
know about the metadata.

I was hit by a corruption today (probably
Possible (rare) shard index corruption / different doc count on recovery (gateway / shard) · Issue #466 · elastic/elasticsearch · GitHub ) and it
would have been nice to be able to recover my backup on the fly, bring
it up to date, then switch aliases - all without downtime.

clint

The other item that Paul and I have hit that can cause network
partitioning and potential index corruption is significant paging. We
had a cluster running with insufficient memory (or too much allocated
to ES) and under load it started paging. As the system became less
responsive, the zen node discovery timed out and a partition occurred.

We ended up tuning our memory allocations for ES down to 50% of
available RAM. Just remember to run realistic load tests with a wide
range of queries and to run them for a sustained period of time. Of
course when it happened to us, we were running 20x the average daily
query volume (and response times were still ~60ms).

This week we began migrating search traffic from our existing search
infrastructure to ES and we've been extremely pleased with
performance.

David

Hi Paul, good questions, answers below:

On Fri, Nov 5, 2010 at 7:57 PM, Paul ppearcy@gmail.com wrote:

Hey,
Things are going great with ES for us, but I wanted to bring up a
few dark spots we've encountered where there may be room for
improvement within ES or on our side.

We've hit some recurring issues in non-production enviroments that are
really making us think long and hard about how we are backing up and
monitoring our system. When we hit these cases, it seems that all bets
are off when it comes to gateway data integrity. I am not suggesting
that ES needs to gracefully handle all these cases, but these are
things others will run into, as well.

Things that should be easily preventable:

  • Max file handles
    This shouldn't happen on a system that is properly configured (Most
    distros do need to bump up the limit from the 1024 default). We
    haven't had a problem with this since, we've got this set
    appropriately, but have corrupted our gateway and local nodes because
    of ths. I'd think it'd be pretty easy for ES to check the ulimit on
    startup and either loudly log the value is too low or refuse to start
    up. This would save some people some pain. I believe I read in another
    thread there is a lucene fix to prevent index corruption when this
    occurs, but isn't in a 3x release, yet.

Hopefully a Lucene patch release will happen soon to help with this. Pinged
the lucene mailing list and it seems like its being worked on. The other
reason for corruption is the fix I pushed to master (clinton pointed to it).
I would love to be able to log a warning if its too low, probably requires
some execution of a script (or maybe in elasticsearch script). Welcome ideas
as its tricky to get... .

  • Gateway disk space fills up
    We hit this when we'd made a backup copy of the gateway on the same
    drive, forgot, and loaded a lot more content. We've added monitoring
    to our systems to start alarming when the gateway free space is less
    than 10%.

If the gateway storage fills up, the snapshot operation will fail be logged.
Once space is cleared, the next snapshot will succeed and catch up with all
the changes.

Not as easily preventable:

  • Network saturation
    Running a couple of nodes on the sam ESX host and another on a
    different, we ended up completely saturating the single GIGe
    connection, which ended up causing a network partion and split brain.
    This also seemed to lead to gateway corruption. Both halves of the
    cluster were receiving documents to index and were both working off
    the same gateway. I believe that this led to two masters for a single
    shard who then did not play nicely with the gateway and ended up
    corrupting it. We've ended up isolating internal ES traffic to its own
    NIC and all external traffic to cluster for indexing and searching to
    a seperate interface.

However, are there any safeguards in place to prevent two nodes from
working on the same gateway data? I believe that by setting the min
number of nodes setting may help with this. So in a 3 node cluster, if
you always required 2 nodes, this would guarantee that you never get
two split clusters.

We do have code in place that attempts to detect the split cluster
condition by iterating over the nodes in the cluster and using the
transport client to validate the node count and comparing that to the
node count from all other servers. Not sure if there is a better way
to do this.

I am currently working on spilt brains and better handling of them. There
are many different aspects to spilt brain, one of them is having two nodes
writing to the same gateway. The local gateway does not suffer from this
problem, by the way. On the file system gateway, I can add a native lock so
two processes will not be able to write to the same one.

Regarding the spilt brain aspect, I am slowly adding features to better
handle it. Some are already in, others are pending :). Of course, when it
comes to it, its all about giving the right dials and being able to tune
them.

For example, you can now start master nodes that have no data (set
node.master to true, and node.data to false). This menas that those nodes
are more light weight, and less affected by "data" related actions (like
overloading query / indexing, which can, for example, cause heavy GC).

Another is the ability to control write consistency level when indexing (in
master), requiring that a quorum of shards to be available for index /
delete operation to succeed.

There is still work left. Some aspects that I want to add are:

  1. The ability to define "minimum nodes", "minimum data nodes", and "minimum
    master nodes" in the cluster. If a node does not see the minimum set, it
    will get into a disconnected mode, and try and rejoin the cluster until the
    minimum set if found. This allow to create different topologies, playing
    with different numbers of master nodes, data nodes, or master+data nodes.
    For example, for a 100 nodes cluster, it might make sense to have 5 master
    nodes, with a minimum master nodes set to 3. Those dials can be set
    differently for different cluster sizes.

  2. Automatically detect split brain, and resolve it, with possible loss of
    docs.

Gateway Safeguards:
In order to make sure that we can quickly recover from any of the
scenarios, we are looking at having redundancy in the gateway with a
main gateway that rsyncs to a backup gateway at certain intervals,
probably daily. This would hopefully allow us to switch to the back up
gateway, and re-run any docs since the last rsync, to allow recovery
time to be around a hour vs a day to rebuild ~100GB of content(ES is
not the bottleneck in the content rebuild, our backend storage is).

I read in a previous thread that you can take a hot copy of the
gateway and it will always be valid due to the usage of commit points
and append only operations. Is this always 100% guaranteed, even if it
takes s long time? For instance, if a merge occurs while the copy is
occurring or if a new commit point is written while the copy is
occurring?

Yes, you can safely do that. Or, you can move to using the local gateway,
with replica factor of 2.

Are there any other recommended safeguards that to help ensure gateway
integrity?

I can add that lock aspect for file system one, this should, at the very
least, help with preventing two shards from writing to the same gateway.

Thanks!
Paul

Thanks for the detailed replies. A couple of comments in-line.

Thanks!

On Nov 5, 2:25 pm, Shay Banon shay.ba...@elasticsearch.com wrote:

Hi Paul, good questions, answers below:

On Fri, Nov 5, 2010 at 7:57 PM, Paul ppea...@gmail.com wrote:

Hey,
Things are going great with ES for us, but I wanted to bring up a
few dark spots we've encountered where there may be room for
improvement within ES or on our side.

We've hit some recurring issues in non-production enviroments that are
really making us think long and hard about how we are backing up and
monitoring our system. When we hit these cases, it seems that all bets
are off when it comes to gateway data integrity. I am not suggesting
that ES needs to gracefully handle all these cases, but these are
things others will run into, as well.

Things that should be easily preventable:

  • Max file handles
    This shouldn't happen on a system that is properly configured (Most
    distros do need to bump up the limit from the 1024 default). We
    haven't had a problem with this since, we've got this set
    appropriately, but have corrupted our gateway and local nodes because
    of ths. I'd think it'd be pretty easy for ES to check the ulimit on
    startup and either loudly log the value is too low or refuse to start
    up. This would save some people some pain. I believe I read in another
    thread there is a lucene fix to prevent index corruption when this
    occurs, but isn't in a 3x release, yet.

Hopefully a Lucene patch release will happen soon to help with this. Pinged
the lucene mailing list and it seems like its being worked on. The other
reason for corruption is the fix I pushed to master (clinton pointed to it).
I would love to be able to log a warning if its too low, probably requires
some execution of a script (or maybe in elasticsearch script). Welcome ideas
as its tricky to get... .

Good to hear about the patch. I think it'd be easy to add the check to
the ulimit in the wrapper start up script. We're running as root, so
actually just explicitly set it. Maybe it makes sense to just set a
good default there? Or you can take the my approach I did in a
verification script for our servers... just keep opening files handles
until failure and then you got your number. I'm sure many would frown
on this, though :slight_smile:

  • Gateway disk space fills up
    We hit this when we'd made a backup copy of the gateway on the same
    drive, forgot, and loaded a lot more content. We've added monitoring
    to our systems to start alarming when the gateway free space is less
    than 10%.

If the gateway storage fills up, the snapshot operation will fail be logged.
Once space is cleared, the next snapshot will succeed and catch up with all
the changes.

Interesting, so most likely the gateway filling triggered something
else to go bad. When this was hit, it appeared we instantly went split
cluster (not sure why), so that is probably the culprit.

Not as easily preventable:

  • Network saturation
    Running a couple of nodes on the sam ESX host and another on a
    different, we ended up completely saturating the single GIGe
    connection, which ended up causing a network partion and split brain.
    This also seemed to lead to gateway corruption. Both halves of the
    cluster were receiving documents to index and were both working off
    the same gateway. I believe that this led to two masters for a single
    shard who then did not play nicely with the gateway and ended up
    corrupting it. We've ended up isolating internal ES traffic to its own
    NIC and all external traffic to cluster for indexing and searching to
    a seperate interface.

However, are there any safeguards in place to prevent two nodes from
working on the same gateway data? I believe that by setting the min
number of nodes setting may help with this. So in a 3 node cluster, if
you always required 2 nodes, this would guarantee that you never get
two split clusters.

We do have code in place that attempts to detect the split cluster
condition by iterating over the nodes in the cluster and using the
transport client to validate the node count and comparing that to the
node count from all other servers. Not sure if there is a better way
to do this.

I am currently working on spilt brains and better handling of them. There
are many different aspects to spilt brain, one of them is having two nodes
writing to the same gateway. The local gateway does not suffer from this
problem, by the way. On the file system gateway, I can add a native lock so
two processes will not be able to write to the same one.

I think that makes a lot of sense. I opened a request:

Regarding the spilt brain aspect, I am slowly adding features to better
handle it. Some are already in, others are pending :). Of course, when it
comes to it, its all about giving the right dials and being able to tune
them.

For example, you can now start master nodes that have no data (set
node.master to true, and node.data to false). This menas that those nodes
are more light weight, and less affected by "data" related actions (like
overloading query / indexing, which can, for example, cause heavy GC).

Another is the ability to control write consistency level when indexing (in
master), requiring that a quorum of shards to be available for index /
delete operation to succeed.

There is still work left. Some aspects that I want to add are:

  1. The ability to define "minimum nodes", "minimum data nodes", and "minimum
    master nodes" in the cluster. If a node does not see the minimum set, it
    will get into a disconnected mode, and try and rejoin the cluster until the
    minimum set if found. This allow to create different topologies, playing
    with different numbers of master nodes, data nodes, or master+data nodes.
    For example, for a 100 nodes cluster, it might make sense to have 5 master
    nodes, with a minimum master nodes set to 3. Those dials can be set
    differently for different cluster sizes.

  2. Automatically detect split brain, and resolve it, with possible loss of
    docs.

All sounds like great features. Looking forward to them.

Gateway Safeguards:
In order to make sure that we can quickly recover from any of the
scenarios, we are looking at having redundancy in the gateway with a
main gateway that rsyncs to a backup gateway at certain intervals,
probably daily. This would hopefully allow us to switch to the back up
gateway, and re-run any docs since the last rsync, to allow recovery
time to be around a hour vs a day to rebuild ~100GB of content(ES is
not the bottleneck in the content rebuild, our backend storage is).

I read in a previous thread that you can take a hot copy of the
gateway and it will always be valid due to the usage of commit points
and append only operations. Is this always 100% guaranteed, even if it
takes s long time? For instance, if a merge occurs while the copy is
occurring or if a new commit point is written while the copy is
occurring?

Yes, you can safely do that. Or, you can move to using the local gateway,
with replica factor of 2.

Cool. My concern around the local gateway is that there is not a good
way to back up the cluster state as a whole to allow a clean restore
if something goes horribly wrong. Could probably back up each node at
an interval, though.

Are there any other recommended safeguards that to help ensure gateway
integrity?

I can add that lock aspect for file system one, this should, at the very
least, help with preventing two shards from writing to the same gateway.

Thanks!
Paul