Data Loss


(Mohit Anchlia) #1

I've read some blogs and some email groups where users have indicated they
have had data loss. In some cases user is able to recover using the source.
I am wondering what are the common reasons this could happen due to ES
software issue assuming there are 2+ replicas and multiple nodes available?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWr%2BgDoo_gsUbDe59-%3DpxirRpnvYgQCeD4t_9Fqqg9tidQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #2

Split brain would be one of the main one I can think of.

Though I know some people have had issues with primary shards not
initialising, though I am not sure what would cause that.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 12 February 2014 08:10, Mohit Anchlia mohitanchlia@gmail.com wrote:

I've read some blogs and some email groups where users have indicated they
have had data loss. In some cases user is able to recover using the source.
I am wondering what are the common reasons this could happen due to ES
software issue assuming there are 2+ replicas and multiple nodes available?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAOT3TWr%2BgDoo_gsUbDe59-%3DpxirRpnvYgQCeD4t_9Fqqg9tidQ%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624a_bPZ7x1YieKVsFGE2dyKn19ADFgK%2BJYbEMf9nKF7hJw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brad Lhotsky-2) #3

There’s also an edge-case with shard allocation during cluster restarts that can result in the loss of data if a shard is being re-allocated.

I saw this behaviour in 0.90.1, recently upgraded to 0.90.10 and haven’t had a failure case like this yet. My use case is logstash style daily indexes for logging data:

Start:

Node 1 [0] [1]
Node 2 [1] [2]
Node 3 [2] [0]

Node 3 goes away, with allocation enabled () mean initializing:

Node 1 [0] [1] (2)
Node 2 [1] [2] (0)

Now, Node 3 comes back, since those shards are being reallocated, it drops its copies.

Node 1 [0] [1] (2)
Node 2 [1] [2] (0)
Node 3 —

Node 2 goes away after that, allocation still enabled,

Node 1 [0] [1] (2)
Node 3 (0) (1) (2)

There is now no full copy of shard 2, just two initialising copies which are incomplete and corrupt.

When node 2 comes back into the cluster, it will discard it’s complete, uncorrupted copy of shard 2, rendering the entire index unusable, and you lose the entire index.

I personally do not consider ElasticSearch a primary data store, so this isn’t an issue given the power it provides. I never expect ElasticSearch to be a reliable data store, and I think that’s OK. It has no built-in ACLs, which means even if this and the rest of the allocation related edge-cases were solved, anyone with access to port 9200 can CRUD your data.

--
Brad Lhotsky

On 12 Feb 2014 at 04:00:01, Mark Walkom (markw@campaignmonitor.com) wrote:

Split brain would be one of the main one I can think of.

Though I know some people have had issues with primary shards not initialising, though I am not sure what would cause that.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 12 February 2014 08:10, Mohit Anchlia mohitanchlia@gmail.com wrote:
I've read some blogs and some email groups where users have indicated they have had data loss. In some cases user is able to recover using the source. I am wondering what are the common reasons this could happen due to ES software issue assuming there are 2+ replicas and multiple nodes available? --
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWr%2BgDoo_gsUbDe59-%3DpxirRpnvYgQCeD4t_9Fqqg9tidQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624a_bPZ7x1YieKVsFGE2dyKn19ADFgK%2BJYbEMf9nKF7hJw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52fb583b.327b23c6.1375%40splitbrain.local.
For more options, visit https://groups.google.com/groups/opt_out.


(Binh Ly) #4

FYI, ES has very frequent releases to fix bugs discovered by the community.
If you find a data loss problem in your current install (and assuming it is
indeed an ES problem), please try the latest build and see if it fixes it.
Chances are it has already been discovered and fixed in the latest release.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/71fe3210-4873-417e-859a-7f83e980f2ed%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Brad Lhotsky-2) #5

Appreciated, but keep in mind large installations can’t just constantly upgrade. And if ES is being used in critical infrastructure upgrading may mean many hours of recertification work with auditors and assessors. The project is still relatively young, but "just upgrade" isn’t always plausible. It takes over 2 hours for a cluster to go back to green when a single node restarts for my logging cluster. I have 15 nodes now, which means a safe upgrade path may take literally 1 working week. That assumes I can have nodes with different versions in the cluster. Or I have to lose data while I restart the whole cluster, which a whole cluster restart is also ~ 4 hours.

--
Brad Lhotsky
On 12 Feb 2014 at 16:07:20, Binh Ly (binh@hibalo.com) wrote:

FYI, ES has very frequent releases to fix bugs discovered by the community. If you find a data loss problem in your current install (and assuming it is indeed an ES problem), please try the latest build and see if it fixes it. Chances are it has already been discovered and fixed in the latest release.

You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/71fe3210-4873-417e-859a-7f83e980f2ed%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.52fb9cd6.2ae8944a.1375%40splitbrain.local.
For more options, visit https://groups.google.com/groups/opt_out.


(Tony Su) #6

IMO evaluating this issue starts with applying the CAP Theorem which in
summary states that networked clusters with multiple nodes can offer only 2
of the following 3 desirable objectives....

Consistency
Availability
Partition tolerance (data distributed across nodes).

ES clearly does the last two so in theory cannot guarantee the first.
Of course "guarantee" is not the same as "best effort" which as expected is
being done.
And, this Theorem applies to multi-node cluster technologies of which
ES is one.

Tony

On Wednesday, February 12, 2014 8:09:58 AM UTC-8, Brad Lhotsky wrote:

Appreciated, but keep in mind large installations can’t just constantly
upgrade. And if ES is being used in critical infrastructure upgrading may
mean many hours of recertification work with auditors and assessors. The
project is still relatively young, but "just upgrade" isn’t always
plausible. It takes over 2 hours for a cluster to go back to green when a
single node restarts for my logging cluster. I have 15 nodes now, which
means a safe upgrade path may take literally 1 working week. That assumes
I can have nodes with different versions in the cluster. Or I have to lose
data while I restart the whole cluster, which a whole cluster restart is
also ~ 4 hours.

--
Brad Lhotsky

On 12 Feb 2014 at 16:07:20, Binh Ly (bi...@hibalo.com <javascript:>)
wrote:

FYI, ES has very frequent releases to fix bugs discovered by the
community. If you find a data loss problem in your current install (and
assuming it is indeed an ES problem), please try the latest build and see
if it fixes it. Chances are it has already been discovered and fixed in the
latest release.

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/71fe3210-4873-417e-859a-7f83e980f2ed%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f9180323-be63-454c-8ac5-2f1559f78d1b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Josh Harrison) #7

I'm sure it isn't the case for everyone that is having data/shard problems,
but I had some real trouble doing a full cluster restart on an 18 node
cluster. Kinda nightmarish, actually, shards failing all over the place,
lost data because of lost shards, etc.
I finally realized that the gateway.recover_after_nodes,
gateway.expected_nodes and gateway.recover_after_time config properties
were critical to avoiding my situation. Before the gateway configuration
stuff was in there, it would take literally hours and a lot of work to get
everything back to green. We dreaded a full cluster restart.
After the gateway configuration stuff, a full cluster restart, from service
restart on all systems to full green, takes anywhere from 2-10 minutes
total. The root cause in my situation was a few nodes coming up in the
cluster, and seeing a severely degraded state and trying to "fix"
everything, resulting in chaos as more nodes came up.
Hopefully this is helpful to someone!
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html
-Josh

On Wednesday, February 12, 2014 11:49:29 AM UTC-8, Tony Su wrote:

IMO evaluating this issue starts with applying the CAP Theorem which in
summary states that networked clusters with multiple nodes can offer only 2
of the following 3 desirable objectives....

Consistency
Availability
Partition tolerance (data distributed across nodes).

ES clearly does the last two so in theory cannot guarantee the first.
Of course "guarantee" is not the same as "best effort" which as expected
is being done.
And, this Theorem applies to multi-node cluster technologies of
which ES is one.

Tony

On Wednesday, February 12, 2014 8:09:58 AM UTC-8, Brad Lhotsky wrote:

Appreciated, but keep in mind large installations can’t just constantly
upgrade. And if ES is being used in critical infrastructure upgrading may
mean many hours of recertification work with auditors and assessors. The
project is still relatively young, but "just upgrade" isn’t always
plausible. It takes over 2 hours for a cluster to go back to green when a
single node restarts for my logging cluster. I have 15 nodes now, which
means a safe upgrade path may take literally 1 working week. That assumes
I can have nodes with different versions in the cluster. Or I have to lose
data while I restart the whole cluster, which a whole cluster restart is
also ~ 4 hours.

--
Brad Lhotsky

On 12 Feb 2014 at 16:07:20, Binh Ly (bi...@hibalo.com) wrote:

FYI, ES has very frequent releases to fix bugs discovered by the
community. If you find a data loss problem in your current install (and
assuming it is indeed an ES problem), please try the latest build and see
if it fixes it. Chances are it has already been discovered and fixed in the
latest release.

You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/71fe3210-4873-417e-859a-7f83e980f2ed%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c23ad042-1fdd-40da-976c-5df12a29d96b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mohit Anchlia) #8

Thanks for sharing this info. It's really helpful. In any case data loss
shouldn't be acceptable to anyone, especially index corruption and not able
to recover at all. I think one shouldn't confuse consistency with data loss
as suggested in this thread. It's also good to hear that most of the bugs
around this issue is being addressed. Please let us know your findings with
the new ES release.

I'll look at teh gateway settings too. Do you also leave shard allocation
disabled?

On Wed, Feb 12, 2014 at 12:07 PM, Josh Harrison hijakk@gmail.com wrote:

I'm sure it isn't the case for everyone that is having data/shard
problems, but I had some real trouble doing a full cluster restart on an 18
node cluster. Kinda nightmarish, actually, shards failing all over the
place, lost data because of lost shards, etc.
I finally realized that the gateway.recover_after_nodes,
gateway.expected_nodes and gateway.recover_after_time config properties
were critical to avoiding my situation. Before the gateway configuration
stuff was in there, it would take literally hours and a lot of work to get
everything back to green. We dreaded a full cluster restart.
After the gateway configuration stuff, a full cluster restart, from
service restart on all systems to full green, takes anywhere from 2-10
minutes total. The root cause in my situation was a few nodes coming up in
the cluster, and seeing a severely degraded state and trying to "fix"
everything, resulting in chaos as more nodes came up.
Hopefully this is helpful to someone!

http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html
-Josh

On Wednesday, February 12, 2014 11:49:29 AM UTC-8, Tony Su wrote:

IMO evaluating this issue starts with applying the CAP Theorem which in
summary states that networked clusters with multiple nodes can offer only 2
of the following 3 desirable objectives....

Consistency
Availability
Partition tolerance (data distributed across nodes).

ES clearly does the last two so in theory cannot guarantee the first.
Of course "guarantee" is not the same as "best effort" which as expected
is being done.
And, this Theorem applies to multi-node cluster technologies of
which ES is one.

Tony

On Wednesday, February 12, 2014 8:09:58 AM UTC-8, Brad Lhotsky wrote:

Appreciated, but keep in mind large installations can't just constantly
upgrade. And if ES is being used in critical infrastructure upgrading may
mean many hours of recertification work with auditors and assessors. The
project is still relatively young, but "just upgrade" isn't always
plausible. It takes over 2 hours for a cluster to go back to green when a
single node restarts for my logging cluster. I have 15 nodes now, which
means a safe upgrade path may take literally 1 working week. That assumes
I can have nodes with different versions in the cluster. Or I have to lose
data while I restart the whole cluster, which a whole cluster restart is
also ~ 4 hours.

--
Brad Lhotsky

On 12 Feb 2014 at 16:07:20, Binh Ly (bi...@hibalo.com) wrote:

FYI, ES has very frequent releases to fix bugs discovered by the
community. If you find a data loss problem in your current install (and
assuming it is indeed an ES problem), please try the latest build and see
if it fixes it. Chances are it has already been discovered and fixed in the
latest release.

You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/71fe3210-4873-417e-859a-7f83e980f2ed%
40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/c23ad042-1fdd-40da-976c-5df12a29d96b%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWo8mMSEc6y%3D0__7JOS-kPmQoLTL0BXJ_8Kd_QmQT1Qg2Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Tony Su) #9

Josh,

Your experience about recovering in only about 10 minutes is very interesting.
Because my little 5-node cluster/15GB data/3500 indices is taking about an hour to recover and i know the bottleneck is the disk subsystem I'm currently on,

Am curious

  • What is the total size data in your cluster?
  • how many indices?
  • Are the shard numbers pretty typical (5 shards per index, 1 replica for every shard)
  • Are you storing your data on a SAN, SCSI array or something else, and if the disks are SSD?

Thx,
Tony

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/26b49ca2-31a3-4800-8d76-16c6ea405c8a%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #10

I use replica shard level 1, and always use latest ES version. I never had
data loss, and that is also due to the fact I have access to dedicated real
servers in our DC just a few meters away, and there are no servers at cloud
server farms with unknown and unstable network environment.

I do always update when the ES releases notes tell the update is
recommended. This must be taken serious! Lucene had some nasty wipe all
bugs for example, and no software is free from bugs.

Updates within minor versions can take place in minutes by taking the whole
cluster down and start again - but this means downtime, if that cluster is
the only cluster. I tuned the nodes to recover fast. For my requirements,
downtime of 15 min is acceptable.

For production, you always need a staging environment. Test your updates
before applying to production systems, with the same data and the same
configuration.

Do not let ES take care of your data alone. Make backups so you can fall
back to a configuration that is known to work. Or make sure you can reindex
instantly from the data source.

I have also a round robin technique that works without downtime. Any
operator can pull a power cable from a single server in the DC but ES will
keep on running. So I can update or service the hardware (e.g. adding RAM).

For updating OS and updating JVM the situation is a bit more challenging
because a rolling update is not possible, but I can set up two clusters for
this.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEQLWzi9yYQUhb8s2jOpswWAUDHH0DxZz9e19mfBfL9aw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #11

You should check index.shard.recovery.concurrent_streams which is default
to 5 per node.

If you have 3500 indices, but only so few data volume, you can turn up this
value very high, and your recovery will be lightning fast. But, the default
value was considered carefully, so a high load of queries always has
precedence over the recovery in the background, so queries do not time out
that easily. Both user comfort and admin patience must be well balanced.

Jörg

On Wed, Feb 12, 2014 at 10:36 PM, Tony Su tonysu999@gmail.com wrote:

Josh,

Your experience about recovering in only about 10 minutes is very
interesting.
Because my little 5-node cluster/15GB data/3500 indices is taking about an
hour to recover and i know the bottleneck is the disk subsystem I'm
currently on,

Am curious

  • What is the total size data in your cluster?
  • how many indices?
  • Are the shard numbers pretty typical (5 shards per index, 1 replica for
    every shard)
  • Are you storing your data on a SAN, SCSI array or something else, and if
    the disks are SSD?

Thx,
Tony

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/26b49ca2-31a3-4800-8d76-16c6ea405c8a%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoGk0KkjBCPcuObbgn%2BvN5VoVcGg5R_RJFyB41Std6K%3DSA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Josh Harrison) #12

This particular cluster is 16 data nodes with SSD RAIDs connected to each
other and the two master nodes with infiniband.
Under 100 indexes and usually 3 shards per index with 1 replica. Overall
data volume is in the 1TB range.
I haven't tweaked the shard allocation settings from default.
-Josh

On Wednesday, February 12, 2014 1:36:53 PM UTC-8, Tony Su wrote:

Josh,

Your experience about recovering in only about 10 minutes is very
interesting.
Because my little 5-node cluster/15GB data/3500 indices is taking about an
hour to recover and i know the bottleneck is the disk subsystem I'm
currently on,

Am curious

  • What is the total size data in your cluster?
  • how many indices?
  • Are the shard numbers pretty typical (5 shards per index, 1 replica for
    every shard)
  • Are you storing your data on a SAN, SCSI array or something else, and if
    the disks are SSD?

Thx,
Tony

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1eccb9f9-4ee7-42b3-adde-352f0c8c65d7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Tony Su) #13

Jorg,
Thx for the suggestion.

My data is not uncommon, apache logs and according to some documentation it's advised to create a separate index for each day's logs.

Tony

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c28dc0ae-2d30-418e-b44a-6cbd8917c734%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mohit Anchlia) #14

On Wed, Feb 12, 2014 at 1:58 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

I use replica shard level 1, and always use latest ES version. I never had
data loss, and that is also due to the fact I have access to dedicated real
servers in our DC just a few meters away, and there are no servers at cloud
server farms with unknown and unstable network environment.

Do you have special settings for "disable shard allocation" and "gateway"
modules? Would you be able to share those settings? There are numerous
number of settings and I am trying to understand which ones make most sense?

How are you taking backups of data today? I believe new incremental backup
feature is slated in 1.0.

Are there other best practices you can share would be helpful.

I do always update when the ES releases notes tell the update is
recommended. This must be taken serious! Lucene had some nasty wipe all
bugs for example, and no software is free from bugs.

Updates within minor versions can take place in minutes by taking the
whole cluster down and start again - but this means downtime, if that
cluster is the only cluster. I tuned the nodes to recover fast. For my
requirements, downtime of 15 min is acceptable.

For production, you always need a staging environment. Test your updates
before applying to production systems, with the same data and the same
configuration.

Do not let ES take care of your data alone. Make backups so you can fall
back to a configuration that is known to work. Or make sure you can reindex
instantly from the data source.

I have also a round robin technique that works without downtime. Any
operator can pull a power cable from a single server in the DC but ES will
keep on running. So I can update or service the hardware (e.g. adding RAM).

For updating OS and updating JVM the situation is a bit more challenging
because a rolling update is not possible, but I can set up two clusters for
this.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEQLWzi9yYQUhb8s2jOpswWAUDHH0DxZz9e19mfBfL9aw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWoTO%3DRTuf1R-AQdiht3jTtcVdGHsJCHnb5Xx89wf_Nqag%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Ivan Brusic) #15

On Wed, Feb 12, 2014 at 1:58 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

For my requirements, downtime of 15 min is acceptable.

I can only wish! I run an ecommerce site, so my requirement is no downtime.
Ever.

--
Ivan

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQDefvR3_e4%3D__5oh%3Dyn6Hcnyq9ikVMHdi0Eqyqrzh57WA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Tony Su) #16

Hi Mo,
I've been experimenting with those settings on 1.0.
The "disable shard allocation" setting doesn't seem to do anything for me.
Shards still re-allocate. Which might explain why I didn't see a
pre-configured setting in elasticsearch.yml so I created my own setting.

But, I did find the "gateway" settings effective.
After some experimentation, I decided to back off from setting
"gateway.recover_after_nodes" from 5 to 4 on my 5 node cluster. Although
requiring all nodes to be functional worked the first few times, if one
node didn't join I didn't want my cluster to just sit there waiting for the
node to eventually join or for me to intervene with manual action since
there is no visual indication of the progress cluster nodes joining and if
there is an issue it can't resolve on its own(except indirectly by using
es-hq and that is only you happen to be pointing to a node that is
successfully joined). Since it seems that shard re-allocation cannot be
avoided entirely, I felt the time saved by allowing the cluster to start
without all its nodes out-weighed trying to maximize all primary copies of
the shards should be available.

Tony

On Wednesday, February 12, 2014 2:27:21 PM UTC-8, Mo wrote:

On Wed, Feb 12, 2014 at 1:58 PM, joerg...@gmail.com <javascript:> <
joerg...@gmail.com <javascript:>> wrote:

I use replica shard level 1, and always use latest ES version. I never
had data loss, and that is also due to the fact I have access to dedicated
real servers in our DC just a few meters away, and there are no servers at
cloud server farms with unknown and unstable network environment.

Do you have special settings for "disable shard allocation" and "gateway"
modules? Would you be able to share those settings? There are numerous
number of settings and I am trying to understand which ones make most sense?

How are you taking backups of data today? I believe new incremental backup
feature is slated in 1.0.

Are there other best practices you can share would be helpful.

I do always update when the ES releases notes tell the update is
recommended. This must be taken serious! Lucene had some nasty wipe all
bugs for example, and no software is free from bugs.

Updates within minor versions can take place in minutes by taking the
whole cluster down and start again - but this means downtime, if that
cluster is the only cluster. I tuned the nodes to recover fast. For my
requirements, downtime of 15 min is acceptable.

For production, you always need a staging environment. Test your updates
before applying to production systems, with the same data and the same
configuration.

Do not let ES take care of your data alone. Make backups so you can fall
back to a configuration that is known to work. Or make sure you can reindex
instantly from the data source.

I have also a round robin technique that works without downtime. Any
operator can pull a power cable from a single server in the DC but ES will
keep on running. So I can update or service the hardware (e.g. adding RAM).

For updating OS and updating JVM the situation is a bit more challenging
because a rolling update is not possible, but I can set up two clusters for
this.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEQLWzi9yYQUhb8s2jOpswWAUDHH0DxZz9e19mfBfL9aw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e5a631c1-f776-48e7-b7f3-631ab02b7cf4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #17

There are changes with 1.0.0 -
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-cluster.html
cluster.routing.allocation.disable_allocation has been deprecated for
cluster.routing.allocation.enable

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 14 February 2014 03:54, Tony Su tonysu999@gmail.com wrote:

Hi Mo,
I've been experimenting with those settings on 1.0.
The "disable shard allocation" setting doesn't seem to do anything for me.
Shards still re-allocate. Which might explain why I didn't see a
pre-configured setting in elasticsearch.yml so I created my own setting.

But, I did find the "gateway" settings effective.
After some experimentation, I decided to back off from setting
"gateway.recover_after_nodes" from 5 to 4 on my 5 node cluster. Although
requiring all nodes to be functional worked the first few times, if one
node didn't join I didn't want my cluster to just sit there waiting for the
node to eventually join or for me to intervene with manual action since
there is no visual indication of the progress cluster nodes joining and if
there is an issue it can't resolve on its own(except indirectly by using
es-hq and that is only you happen to be pointing to a node that is
successfully joined). Since it seems that shard re-allocation cannot be
avoided entirely, I felt the time saved by allowing the cluster to start
without all its nodes out-weighed trying to maximize all primary copies of
the shards should be available.

Tony

On Wednesday, February 12, 2014 2:27:21 PM UTC-8, Mo wrote:

On Wed, Feb 12, 2014 at 1:58 PM, joerg...@gmail.com joerg...@gmail.comwrote:

I use replica shard level 1, and always use latest ES version. I never
had data loss, and that is also due to the fact I have access to dedicated
real servers in our DC just a few meters away, and there are no servers at
cloud server farms with unknown and unstable network environment.

Do you have special settings for "disable shard allocation" and "gateway"
modules? Would you be able to share those settings? There are numerous
number of settings and I am trying to understand which ones make most sense?

How are you taking backups of data today? I believe new incremental
backup feature is slated in 1.0.

Are there other best practices you can share would be helpful.

I do always update when the ES releases notes tell the update is
recommended. This must be taken serious! Lucene had some nasty wipe all
bugs for example, and no software is free from bugs.

Updates within minor versions can take place in minutes by taking the
whole cluster down and start again - but this means downtime, if that
cluster is the only cluster. I tuned the nodes to recover fast. For my
requirements, downtime of 15 min is acceptable.

For production, you always need a staging environment. Test your updates
before applying to production systems, with the same data and the same
configuration.

Do not let ES take care of your data alone. Make backups so you can fall
back to a configuration that is known to work. Or make sure you can reindex
instantly from the data source.

I have also a round robin technique that works without downtime. Any
operator can pull a power cable from a single server in the DC but ES will
keep on running. So I can update or service the hardware (e.g. adding RAM).

For updating OS and updating JVM the situation is a bit more challenging
because a rolling update is not possible, but I can set up two clusters for
this.

Jörg

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/CAKdsXoEQLWzi9yYQUhb8s2jOpswWA
UDHH0DxZz9e19mfBfL9aw%40mail.gmail.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/e5a631c1-f776-48e7-b7f3-631ab02b7cf4%40googlegroups.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624bg30MvKHUuyY%3DR4vpiwyDgED%3DDuhjdmj16_fCc5FkWtA%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #18

I'm keeping my shards small (1-5 GB), I do not disable shard allocation,
the recovery is fast enough (a few minutes)

Gateway is local, the default.

I keep backups on the file system, with plain cp -a or rsync, but also on
the data source side - I can reindex everything in a few hours, I have only
~100 mio. docs to index.

Because I don't operate a shared file system, I can not use ES 1.0
snapshot/restore yet.

Jörg

On Wed, Feb 12, 2014 at 11:27 PM, Mohit Anchlia mohitanchlia@gmail.comwrote:

Do you have special settings for "disable shard allocation" and "gateway"
modules? Would you be able to share those settings? There are numerous
number of settings and I am trying to understand which ones make most sense?

How are you taking backups of data today? I believe new incremental backup
feature is slated in 1.0.

Are there other best practices you can share would be helpful.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEE-XtuDbDnv8CLrJ_tDxDLB7spZ-_cfcpjaAa4_qdB4Q%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Jörg Prante) #19

Personally this is what I am after because I know how frustrating it is for
the user. For operating library catalog search, we do not lose (too much)
money if one of our systems go down for some minutes, but users lose time
and get annoyed. Hard to convince the supervisors to set up 24/7 operation
mode and SLAs - this is more expensive than random short system outages.
And, even with 24/7, the search engine is only one part of a whole complex
network.

Jörg

On Thu, Feb 13, 2014 at 5:56 AM, Ivan Brusic ivan@brusic.com wrote:

On Wed, Feb 12, 2014 at 1:58 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

For my requirements, downtime of 15 min is acceptable.

I can only wish! I run an ecommerce site, so my requirement is no
downtime. Ever.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF-hF%3D6q%3DcjihZ6qksCV6sEmQ8nTONkQhLjW1AAR7YkHw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mohit Anchlia) #20

This is interesting, I was thinking by having more replicas we can work
around issues when few nodes go down. Is this more challenging than that? I
am trying to understand where should I put more focus on to get HA search
solution. We have large data sets so I also worry about too much index
movement causing performance issues. Is there a downside of completely
disabling shard allocation.

On Thu, Feb 13, 2014 at 2:26 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Personally this is what I am after because I know how frustrating it is
for the user. For operating library catalog search, we do not lose (too
much) money if one of our systems go down for some minutes, but users lose
time and get annoyed. Hard to convince the supervisors to set up 24/7
operation mode and SLAs - this is more expensive than random short system
outages. And, even with 24/7, the search engine is only one part of a whole
complex network.

Jörg

On Thu, Feb 13, 2014 at 5:56 AM, Ivan Brusic ivan@brusic.com wrote:

On Wed, Feb 12, 2014 at 1:58 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

For my requirements, downtime of 15 min is acceptable.

I can only wish! I run an ecommerce site, so my requirement is no
downtime. Ever.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAKdsXoF-hF%3D6q%3DcjihZ6qksCV6sEmQ8nTONkQhLjW1AAR7YkHw%40mail.gmail.com
.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAOT3TWqUcHo3sqb-aUXwp6zJ5-fy-PUE1hy_r6pgoO2Rh1w51w%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.