Data Loss

I'm sure it isn't the case for everyone that is having data/shard problems,
but I had some real trouble doing a full cluster restart on an 18 node
cluster. Kinda nightmarish, actually, shards failing all over the place,
lost data because of lost shards, etc.
I finally realized that the gateway.recover_after_nodes,
gateway.expected_nodes and gateway.recover_after_time config properties
were critical to avoiding my situation. Before the gateway configuration
stuff was in there, it would take literally hours and a lot of work to get
everything back to green. We dreaded a full cluster restart.
After the gateway configuration stuff, a full cluster restart, from service
restart on all systems to full green, takes anywhere from 2-10 minutes
total. The root cause in my situation was a few nodes coming up in the
cluster, and seeing a severely degraded state and trying to "fix"
everything, resulting in chaos as more nodes came up.
Hopefully this is helpful to someone!

-Josh

On Wednesday, February 12, 2014 11:49:29 AM UTC-8, Tony Su wrote:

IMO evaluating this issue starts with applying the CAP Theorem which in
summary states that networked clusters with multiple nodes can offer only 2
of the following 3 desirable objectives....

Consistency
Availability
Partition tolerance (data distributed across nodes).

ES clearly does the last two so in theory cannot guarantee the first.
Of course "guarantee" is not the same as "best effort" which as expected
is being done.
And, this Theorem applies to multi-node cluster technologies of
which ES is one.

Tony

On Wednesday, February 12, 2014 8:09:58 AM UTC-8, Brad Lhotsky wrote:

Appreciated, but keep in mind large installations can’t just constantly
upgrade. And if ES is being used in critical infrastructure upgrading may
mean many hours of recertification work with auditors and assessors. The
project is still relatively young, but "just upgrade" isn’t always
plausible. It takes over 2 hours for a cluster to go back to green when a
single node restarts for my logging cluster. I have 15 nodes now, which
means a safe upgrade path may take literally 1 working week. That assumes
I can have nodes with different versions in the cluster. Or I have to lose
data while I restart the whole cluster, which a whole cluster restart is
also ~ 4 hours.

--
Brad Lhotsky

On 12 Feb 2014 at 16:07:20, Binh Ly (bi...@hibalo.com) wrote:

FYI, ES has very frequent releases to fix bugs discovered by the
community. If you find a data loss problem in your current install (and
assuming it is indeed an ES problem), please try the latest build and see
if it fixes it. Chances are it has already been discovered and fixed in the
latest release.

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/71fe3210-4873-417e-859a-7f83e980f2ed%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c23ad042-1fdd-40da-976c-5df12a29d96b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.