Hey,
Things are going great with ES for us, but I wanted to bring up a
few dark spots we've encountered where there may be room for
improvement within ES or on our side.
We've hit some recurring issues in non-production enviroments that are
really making us think long and hard about how we are backing up and
monitoring our system. When we hit these cases, it seems that all bets
are off when it comes to gateway data integrity. I am not suggesting
that ES needs to gracefully handle all these cases, but these are
things others will run into, as well.
Things that should be easily preventable:
-
Max file handles
This shouldn't happen on a system that is properly configured (Most
distros do need to bump up the limit from the 1024 default). We
haven't had a problem with this since, we've got this set
appropriately, but have corrupted our gateway and local nodes because
of ths. I'd think it'd be pretty easy for ES to check the ulimit on
startup and either loudly log the value is too low or refuse to start
up. This would save some people some pain. I believe I read in another
thread there is a lucene fix to prevent index corruption when this
occurs, but isn't in a 3x release, yet. -
Gateway disk space fills up
We hit this when we'd made a backup copy of the gateway on the same
drive, forgot, and loaded a lot more content. We've added monitoring
to our systems to start alarming when the gateway free space is less
than 10%.
Not as easily preventable:
- Network saturation
Running a couple of nodes on the sam ESX host and another on a
different, we ended up completely saturating the single GIGe
connection, which ended up causing a network partion and split brain.
This also seemed to lead to gateway corruption. Both halves of the
cluster were receiving documents to index and were both working off
the same gateway. I believe that this led to two masters for a single
shard who then did not play nicely with the gateway and ended up
corrupting it. We've ended up isolating internal ES traffic to its own
NIC and all external traffic to cluster for indexing and searching to
a seperate interface.
However, are there any safeguards in place to prevent two nodes from
working on the same gateway data? I believe that by setting the min
number of nodes setting may help with this. So in a 3 node cluster, if
you always required 2 nodes, this would guarantee that you never get
two split clusters.
We do have code in place that attempts to detect the split cluster
condition by iterating over the nodes in the cluster and using the
transport client to validate the node count and comparing that to the
node count from all other servers. Not sure if there is a better way
to do this.
Gateway Safeguards:
In order to make sure that we can quickly recover from any of the
scenarios, we are looking at having redundancy in the gateway with a
main gateway that rsyncs to a backup gateway at certain intervals,
probably daily. This would hopefully allow us to switch to the back up
gateway, and re-run any docs since the last rsync, to allow recovery
time to be around a hour vs a day to rebuild ~100GB of content(ES is
not the bottleneck in the content rebuild, our backend storage is).
I read in a previous thread that you can take a hot copy of the
gateway and it will always be valid due to the usage of commit points
and append only operations. Is this always 100% guaranteed, even if it
takes s long time? For instance, if a merge occurs while the copy is
occurring or if a new commit point is written while the copy is
occurring?
Are there any other recommended safeguards that to help ensure gateway
integrity?
Thanks!
Paul