Avoid rebalancing upon failure


(Filirom1) #1

Is it possible to avoid automatic rebalancing when a node fails ?
It would avoid cascading failures.

Thanks in advance

Romain


(Ivan Brusic) #2

You can disable allocation:

http://www.elasticsearch.org/guide/reference/api/admin-cluster-update-settings.html

Look for cluster.routing.allocation.disable_allocation

Cheers,

Ivan

On Tue, Jul 31, 2012 at 5:10 AM, Romain filirom1@gmail.com wrote:

Is it possible to avoid automatic rebalancing when a node fails ?
It would avoid cascading failures.

Thanks in advance

Romain


(Filirom1) #3

Thank you,

I missed this page in the documentation.

2012/7/31 Ivan Brusic ivan@brusic.com

You can disable allocation:

http://www.elasticsearch.org/guide/reference/api/admin-cluster-update-settings.html

Look for cluster.routing.allocation.disable_allocation

Cheers,

Ivan

On Tue, Jul 31, 2012 at 5:10 AM, Romain filirom1@gmail.com wrote:

Is it possible to avoid automatic rebalancing when a node fails ?
It would avoid cascading failures.

Thanks in advance

Romain


(phill) #4

The page referenced provides no information about what these parameters
mean.
But I see some explanation on
http://www.elasticsearch.org/guide/reference/modules/cluster.html
there is another parameter setting:
|"c.r.a.allow_rebalance| Allow to control when rebalancing will happen
based on the total state of all the indices shards in the cluster.
|always|, |indices_primaries_active|, and |indices_all_active| are
allowed, defaulting to |indices_all_active| to reduce chatter during
initial recovery."
(where c.r.a means cluster.routing.allocation)
But it only defines when rebalancing does occur. For Romain's question
it is not what is needed, it does not have a "none" value.
That page goes on to also list the parameter setting
cluster.routing.allocation.disable_allocation
|"c.r.a.disable_allocation| Allows to disable either primary or replica
allocation. Note, a replica will still be promoted to primary if one
does not exists. This setting really make sense when dynamically
updating it using the cluster update settings API."
Related to the explanation of disable_allocation: Is there anywhere on
the site that describes the process of failover and promotion where a
replica is used for as a primary for queries?
Other than the line in the above "Note, a replica will still be promoted
to primary if one does not exists."
Conversely
Is there anywhere on the site that describes the process that occurs if
a primary comes back to the cluster?
For example: what happens if documents have been added while a replica
shard was acting as a primary shard?
(I hope someone will direct me to existing information, because I'm not
trying to hijack this thread for an entire explanation of rebalancing,
promotion, demotion... etc.
If such information is missing, a simple solution might be a few
glossary entries about 'promotion', 'recovery' and whatever other states
a shard might be in)

-Paul

On 8/1/2012 5:53 AM, Romain wrote:

Thank you,

I missed this page in the documentation.

2012/7/31 Ivan Brusic <ivan@brusic.com mailto:ivan@brusic.com>

You can disable allocation:

http://www.elasticsearch.org/guide/reference/api/admin-cluster-update-settings.html

Look for cluster.routing.allocation.disable_allocation

Cheers,

Ivan


On Tue, Jul 31, 2012 at 5:10 AM, Romain <filirom1@gmail.com
<mailto:filirom1@gmail.com>> wrote:
> Is it possible to avoid automatic rebalancing when a node fails ?
> It would avoid cascading failures.
>
> Thanks in advance
>
> Romain

--


(Filirom1) #5

Hi

I try to answer your question, but correct me if I get it wrong.

2012/8/16 P. Hill parehill1@gmail.com

The page referenced provides no information about what these parameters
mean.
But I see some explanation on
http://www.elasticsearch.org/**guide/reference/modules/**cluster.htmlhttp://www.elasticsearch.org/guide/reference/modules/cluster.html
there is another parameter setting:
|"c.r.a.allow_rebalance| Allow to control when rebalancing will happen
based on the total state of all the indices shards in the cluster.
|always|, |indices_primaries_active|, and |indices_all_active| are
allowed, defaulting to |indices_all_active| to reduce chatter during
initial recovery."
(where c.r.a means cluster.routing.allocation)
But it only defines when rebalancing does occur. For Romain's question it
is not what is needed, it does not have a "none" value.
That page goes on to also list the parameter setting
cluster.routing.allocation.**disable_allocation
|"c.r.a.disable_allocation| Allows to disable either primary or replica
allocation. Note, a replica will still be promoted to primary if one does
not exists. This setting really make sense when dynamically updating it
using the cluster update settings API."
Related to the explanation of disable_allocation: Is there anywhere on
the site that describes the process of failover and promotion where a
replica is used for as a primary for queries?

I found this thread in the ML :
http://elasticsearch-users.115913.n3.nabble.com/How-does-a-recovering-node-validate-any-shard-information-data-during-recover-td3215028.html

Other than the line in the above "Note, a replica will still be promoted
to primary if one does not exists."
Conversely
Is there anywhere on the site that describes the process that occurs if a
primary comes back to the cluster?

As I understand, this the role of the master node to decide who is the
primary shard. When the primary shard die, the master node elect a new
shard to be the master. When the previous primary shard come back, it will
get it states from the master node. The master node will tell it that now
it is a replica.

About the master node

For example: what happens if documents have been added while a replica
shard was acting as a primary shard?

The old replica is now the primary shard. The old primary shard become a
replica and is synchronized like replicas.
http://es-cn.medcl.net/guide/concepts/scaling-lucene/transaction-log/index.html

(I hope someone will direct me to existing information, because I'm not
trying to hijack this thread for an entire explanation of rebalancing,
promotion, demotion... etc.
If such information is missing, a simple solution might be a few glossary
entries about 'promotion', 'recovery' and whatever other states a shard
might be in)

Indeed I would love to see some documentations about the design concepts.
I found those links that helps (to be translated in english) :

http://es-cn.medcl.net/guide/concepts/scaling-lucene/building-blocks/index.html
*
http://es-cn.medcl.net/guide/concepts/scaling-lucene/partitioning/index.html
*
http://es-cn.medcl.net/guide/concepts/scaling-lucene/replication/index.html
*
http://es-cn.medcl.net/guide/concepts/scaling-lucene/transaction-log/index.html

-Paul

Romain

On 8/1/2012 5:53 AM, Romain wrote:

Thank you,

I missed this page in the documentation.

2012/7/31 Ivan Brusic <ivan@brusic.com mailto:ivan@brusic.com>

You can disable allocation:

http://www.elasticsearch.org/**guide/reference/api/admin-**

cluster-update-settings.htmlhttp://www.elasticsearch.org/guide/reference/api/admin-cluster-update-settings.html

Look for cluster.routing.allocation.**disable_allocation

Cheers,

Ivan


On Tue, Jul 31, 2012 at 5:10 AM, Romain <filirom1@gmail.com
<mailto:filirom1@gmail.com>> wrote:
> Is it possible to avoid automatic rebalancing when a node fails ?
> It would avoid cascading failures.
>
> Thanks in advance
>
> Romain

--

--


(phill) #6

On 8/20/2012 2:07 AM, Romain wrote:

As I understand, this the role of the master node to decide who is the
primary shard. When the primary shard die, the master node elect a new
shard to be the master. When the previous primary shard come back, it
will get it states from the master node. The master node will tell it
that now it is a replica.

About the master
node

For example: what happens if documents have been added while a
replica shard was acting as a primary shard?

The old replica is now the primary shard. The old primary shard become
a replica and is synchronized like replicas.
http://es-cn.medcl.net/guide/concepts/scaling-lucene/transaction-log/index.html

(I hope someone will direct me to existing information, because
I'm not trying to hijack this thread for an entire explanation of
rebalancing, promotion, demotion... etc.
If such information is missing, a simple solution might be a few
glossary entries about 'promotion', 'recovery' and whatever other
states a shard might be in)

Indeed I would love to see some documentations about the design concepts.
I found those links that helps (to be translated in english) :

Thanks,

I see the problem, on the ES website there is nothing under "Concepts"
as on your mixed Chinese/English site, so there are no pages to provide
an overview to describe the role of the transaction log, the states a
shard might be in, or the role of the master node at various points.
Your concept pages taken from "Road to a Distributed Search Engine" is
useful.

Does your worthy attempt at describing recovery need one last step?
You said
"The old replica is now the primary shard. The old primary shard become
a replica and is synchronized like replicas."
since presumably the particular ES index was designed to have the
primary shards well distribute on start up (possibly through some shard
allocation filtering, another incompletely described concept on the
website), I then assume that the final steps are:

  1. Once synchronization is complete and both shards are in the same state.
    a. the replica (old primary) switches to become the primary again.
    b. the current primary (old replica) switches back to it's original
    role as a replica.

Anyone have any insight to confirm that this final step to switch back
to original roles actually occurs?
-Paul

--


(Filirom1) #7

As I understand ES, it was designed to well distribute shards (not only the
primary). I don't think the final step you describe happens.

Correct me, if I am missing something.

Romain

2012/8/23 P. Hill parehill1@gmail.com

On 8/20/2012 2:07 AM, Romain wrote:

As I understand, this the role of the master node to decide who is the
primary shard. When the primary shard die, the master node elect a new
shard to be the master. When the previous primary shard come back, it will
get it states from the master node. The master node will tell it that now
it is a replica.

[About the master node](http://www.elasticsearch.org/guide/
reference/modules/discovery/http://www.elasticsearch.org/guide/reference/modules/discovery/
)

For example: what happens if documents have been added while a
replica shard was acting as a primary shard?

The old replica is now the primary shard. The old primary shard become a
replica and is synchronized like replicas.
http://es-cn.medcl.net/guide/concepts/scaling-lucene/
transaction-log/index.htmlhttp://es-cn.medcl.net/guide/concepts/scaling-lucene/transaction-log/index.html

(I hope someone will direct me to existing information, because
I'm not trying to hijack this thread for an entire explanation of
rebalancing, promotion, demotion... etc.
If such information is missing, a simple solution might be a few
glossary entries about 'promotion', 'recovery' and whatever other
states a shard might be in)

Indeed I would love to see some documentations about the design concepts.
I found those links that helps (to be translated in english) :

Thanks,

I see the problem, on the ES website there is nothing under "Concepts" as
on your mixed Chinese/English site, so there are no pages to provide an
overview to describe the role of the transaction log, the states a shard
might be in, or the role of the master node at various points. Your
concept pages taken from "Road to a Distributed Search Engine" is useful.

Does your worthy attempt at describing recovery need one last step?
You said

"The old replica is now the primary shard. The old primary shard become a
replica and is synchronized like replicas."
since presumably the particular ES index was designed to have the primary
shards well distribute on start up (possibly through some shard allocation
filtering, another incompletely described concept on the website), I then
assume that the final steps are:

  1. Once synchronization is complete and both shards are in the same state.
    a. the replica (old primary) switches to become the primary again.
    b. the current primary (old replica) switches back to it's original
    role as a replica.

Anyone have any insight to confirm that this final step to switch back to
original roles actually occurs?
-Paul

--

--


(phill) #8

But if you had A, B and C node with a,b,c shards on them, then if B went
out b's primary is now a or c.
Now if c goes out c's primary would be A or B etc. but I guess the
attempt at balancing would not keep picking the replicas on A to become
primaries.
A a
B b
C c
B goes off-line
A a
C c, b
B comes back
But now I imagine that because of this very dynamic ability to rebalance
(and not a final step in the bring things back on-line), it is going to
move one or another to B, so we might now have
A a
B c
C b
But then I start to think about explicit routing; rereading what I can
find, it doesn't actually say it will stick with a node, but with a
shard (the value is stored in the shard, if I understand the info), so I
suppose my last step is not part of re-syncing a shard (or whatever it
is called) as much as keeping things balanced, so ends up a little
different and primary shards might drift around, but not become
concentrated.

Thanks for the discussion,

-Paul

On 8/24/2012 12:26 AM, Romain wrote:

As I understand ES, it was designed to well distribute shards (not
only the primary). I don't think the final step you describe happens.

Correct me, if I am missing something.

Romain

2012/8/23 P. Hill <parehill1@gmail.com mailto:parehill1@gmail.com>

On 8/20/2012 2:07 AM, Romain wrote:


    As I understand, this the role of the master node to decide
    who is the primary shard. When the primary shard die, the
    master node elect a new shard to be the master. When the
    previous primary shard come back, it will get it states from
    the master node. The master node will tell it that now it is a
    replica.

    [About the master
    node](http://www.elasticsearch.org/guide/reference/modules/discovery/)

        For example: what happens if documents have been added while a
        replica shard was acting as a primary shard?


    The old replica is now the primary shard. The old primary
    shard become a replica and is synchronized like replicas.
    http://es-cn.medcl.net/guide/concepts/scaling-lucene/transaction-log/index.html


        (I hope someone will direct me to existing information,
    because
        I'm not trying to hijack this thread for an entire
    explanation of
        rebalancing, promotion, demotion... etc.
        If such information is missing, a simple solution might be
    a few
        glossary entries about 'promotion', 'recovery' and
    whatever other
        states a shard might be in)


    Indeed I would love to see some documentations about the
    design concepts.
    I found those links that helps (to be translated in english) :

Thanks,

I see the problem, on the ES website there is nothing under
"Concepts" as on your mixed Chinese/English site, so there are no
pages to provide an overview to describe the role of the
transaction log, the states a shard might be in, or the role of
the master node at various points.  Your concept pages taken from
"Road to a Distributed Search Engine" is useful.

Does your worthy attempt at describing recovery need one last step?
You said

"The old replica is now the primary shard. The old primary shard
become a replica and is synchronized like replicas."
since presumably the particular ES index was designed to have the
primary shards well distribute on start up (possibly through some
shard allocation filtering, another incompletely described concept
on the website),  I then assume that the final steps are:
1. Once synchronization is complete and both shards are in the
same state.
    a. the replica (old primary) switches to become the primary again.
    b. the current primary (old replica) switches back to it's
original role as a replica.

Anyone have any insight to confirm that this final step to switch
back to original roles actually occurs?
-Paul

-- 

--

--


(system) #9