Inconsistent state of replica-primary shards after node disconnection


(mjo77) #1

Hello,

I am working on a platform that has 4 nodes. The node1 is master only, no
data. node2, node3 and node4 are data only nodes (not master). However
node4, has poorer network connectivity conditions to the 3 nodes, than the
one they have among themselves.
I have only one index, with 2 shards with replica 2 (so all data nodes,
node2, node3 and node4) have some shards. The primary shards are only in
node4, as it is where most of the queries will take place, also inserts.
Inserts are done with replication=async mode, so for a while data that is
written in node4 is not synchronized with the other replica shards.
Here it is my problem. I did some performance experiments for our
particular usecase. I wanted to know how long it would take for the system
to have a consistent replica's state after some inserts. So here it is what
I did: after I wrote about 10000 documents to node4, I waited for them to
be replicated to node2 and node3 and I measured the time. This replication
indeed happens and takes a more or less expected duration. However in some
cases when there is disconnetion of the primary shard's node (node4 in this
case) with the data nodes, synchronization would just stop working. I know
that will happen always when I see the following in node4 (primary shards
node):

[2013-12-17 10:27:40,599][DEBUG][transport.netty ] [node4]
disconnected from
[[node3][0rOpsTurSKGWXOaGYVN90g][inet[/1x1.17.2x0.1x7:9302]]{tag=node3,
max_local_storage_nodes=1, master=false}], channel closed event <<<<<
From now on node3 won't be synchronized
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4] marking
shard [laundry1s][0] as active indexing wise
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4] marking
shard [laundry1s][1] as active indexing wise
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4]
recalculating shard indexing buffer (reason=active/inactive[true]
created/deleted[false]), total is [203.1mb] with [2] active shards, each
shard set to indexing=[101.5mb], translog=[64kb]
[2013-12-17 10:27:43,347][DEBUG][index.engine.robin ] [node4]
[laundry1s][0] updating index_buffer_size from [500kb] to [101.5mb]
[2013-12-17 10:27:43,347][DEBUG][index.engine.robin ] [node4]
[laundry1s][1] updating index_buffer_size from [500kb] to [101.5mb]
[2013-12-17 10:27:47,984][DEBUG][transport.netty ] [node4]
connected to node
[[node3][0rOpsTurSKGWXOaGYVN90g][inet[/1x1.17.2x0.1x7:9302]]{tag=node3,
max_local_storage_nodes=1, master=false}]
[2013-12-17 10:27:49,530][DEBUG][transport.netty ] [node4]
disconnected from
[[node2][cVrCQo_bQBqM2V6ku-FvAQ][inet[/1x1.17.2x0.1x7:9301]]{tag=node2,
max_local_storage_nodes=1, master=false}], channel closed event <<<<< From
now on node2 won't be synchronized
[2013-12-17 10:28:01,288][DEBUG][transport.netty ] [node4]
connected to node
[[node2][cVrCQo_bQBqM2V6ku-FvAQ][inet[/1x1.17.2x0.1x7:9301]]{tag=node2,
max_local_storage_nodes=1, master=false}]

This is reflected in the amount of documents each shard has:

I have waited for about 2 hours, and no synchronization took place.

My question: is there a way to prevent such disconnections to happen? I
would like to make my ES cluster more tolerant to network short-term
irregularities. Timeout configurations?

Thanks,

Mauricio

P.S.

./elasticsearch -v
Version: 0.90.5, Build: c8714e8/2013-09-17T12:50:20Z, JVM: 1.6.0_43

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b6af6650-fccd-401d-8e9a-9a7302d13127%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Mark Walkom) #2

It might not help, but you should upgrade up to 1.7 Java, 1.6 has a fair
few known issues.

You should also take a look at this page
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 17 December 2013 23:16, mjo77 mauriciojost@gmail.com wrote:

Hello,

I am working on a platform that has 4 nodes. The node1 is master only, no
data. node2, node3 and node4 are data only nodes (not master). However
node4, has poorer network connectivity conditions to the 3 nodes, than the
one they have among themselves.
I have only one index, with 2 shards with replica 2 (so all data nodes,
node2, node3 and node4) have some shards. The primary shards are only in
node4, as it is where most of the queries will take place, also inserts.
Inserts are done with replication=async mode, so for a while data that is
written in node4 is not synchronized with the other replica shards.
Here it is my problem. I did some performance experiments for our
particular usecase. I wanted to know how long it would take for the system
to have a consistent replica's state after some inserts. So here it is what
I did: after I wrote about 10000 documents to node4, I waited for them to
be replicated to node2 and node3 and I measured the time. This replication
indeed happens and takes a more or less expected duration. However in some
cases when there is disconnetion of the primary shard's node (node4 in this
case) with the data nodes, synchronization would just stop working. I know
that will happen always when I see the following in node4 (primary shards
node):

[2013-12-17 10:27:40,599][DEBUG][transport.netty ] [node4]
disconnected from
[[node3][0rOpsTurSKGWXOaGYVN90g][inet[/1x1.17.2x0.1x7:9302]]{tag=node3,
max_local_storage_nodes=1, master=false}], channel closed event <<<<<
From now on node3 won't be synchronized
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4]
marking shard [laundry1s][0] as active indexing wise
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4]
marking shard [laundry1s][1] as active indexing wise
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4]
recalculating shard indexing buffer (reason=active/inactive[true]
created/deleted[false]), total is [203.1mb] with [2] active shards, each
shard set to indexing=[101.5mb], translog=[64kb]
[2013-12-17 10:27:43,347][DEBUG][index.engine.robin ] [node4]
[laundry1s][0] updating index_buffer_size from [500kb] to [101.5mb]
[2013-12-17 10:27:43,347][DEBUG][index.engine.robin ] [node4]
[laundry1s][1] updating index_buffer_size from [500kb] to [101.5mb]
[2013-12-17 10:27:47,984][DEBUG][transport.netty ] [node4]
connected to node
[[node3][0rOpsTurSKGWXOaGYVN90g][inet[/1x1.17.2x0.1x7:9302]]{tag=node3,
max_local_storage_nodes=1, master=false}]
[2013-12-17 10:27:49,530][DEBUG][transport.netty ] [node4]
disconnected from
[[node2][cVrCQo_bQBqM2V6ku-FvAQ][inet[/1x1.17.2x0.1x7:9301]]{tag=node2,
max_local_storage_nodes=1, master=false}], channel closed event <<<<< From
now on node2 won't be synchronized
[2013-12-17 10:28:01,288][DEBUG][transport.netty ] [node4]
connected to node
[[node2][cVrCQo_bQBqM2V6ku-FvAQ][inet[/1x1.17.2x0.1x7:9301]]{tag=node2,
max_local_storage_nodes=1, master=false}]

This is reflected in the amount of documents each shard has:

https://gist.github.com/mauriciojost/8003895

I have waited for about 2 hours, and no synchronization took place.

My question: is there a way to prevent such disconnections to happen? I
would like to make my ES cluster more tolerant to network short-term
irregularities. Timeout configurations?

Thanks,

Mauricio

P.S.

./elasticsearch -v
Version: 0.90.5, Build: c8714e8/2013-09-17T12:50:20Z, JVM: 1.6.0_43

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b6af6650-fccd-401d-8e9a-9a7302d13127%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aQ-qLGq%2B4COY2HLHmgvPkJ%3DsvisCGQZ1_2go-woq-0Tg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(mjo77) #3

Thanks Mark,

I changed the JVM but I have the same problem (using:

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
)

I am running the nodes on Windows 2008 R2 VMs, 2 cores and 3.5 GB of RAM,
2GB of heap size for each node, just in case it matters.
About the fault-detection parameters, I have already tried playing with
them. The ones I am using currently:

discovery.zen.fd.ping_interval: 5s
discovery.zen.fd.ping_timeout: 30s
discovery.zen.fd.ping_retries: 10

However from what I see, the error comes at a lower level than node failure
detection, as I see that "channel closed event" debug message, but I am
just guessing. We are trying in Azure and in some cases even with only the
3 well connected VMs (node1,node2,node3 as master and data nodes this time)
we experienced the same problem, inserting data in some nodes with async
replication, then such disconnection and the nodes remain with inconsistent
primary-replica shards.

Are there some other timeout connection parameters I can try with to make
my system more tolerant to kind of more irregular networks?

Mauricio

On 17 December 2013 22:44, Mark Walkom markw@campaignmonitor.com wrote:

It might not help, but you should upgrade up to 1.7 Java, 1.6 has a fair
few known issues.

You should also take a look at this page
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html#fault-detection

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 17 December 2013 23:16, mjo77 mauriciojost@gmail.com wrote:

Hello,

I am working on a platform that has 4 nodes. The node1 is master only, no
data. node2, node3 and node4 are data only nodes (not master). However
node4, has poorer network connectivity conditions to the 3 nodes, than the
one they have among themselves.
I have only one index, with 2 shards with replica 2 (so all data nodes,
node2, node3 and node4) have some shards. The primary shards are only in
node4, as it is where most of the queries will take place, also inserts.
Inserts are done with replication=async mode, so for a while data that is
written in node4 is not synchronized with the other replica shards.
Here it is my problem. I did some performance experiments for our
particular usecase. I wanted to know how long it would take for the system
to have a consistent replica's state after some inserts. So here it is what
I did: after I wrote about 10000 documents to node4, I waited for them to
be replicated to node2 and node3 and I measured the time. This replication
indeed happens and takes a more or less expected duration. However in some
cases when there is disconnetion of the primary shard's node (node4 in this
case) with the data nodes, synchronization would just stop working. I know
that will happen always when I see the following in node4 (primary shards
node):

[2013-12-17 10:27:40,599][DEBUG][transport.netty ] [node4]
disconnected from
[[node3][0rOpsTurSKGWXOaGYVN90g][inet[/1x1.17.2x0.1x7:9302]]{tag=node3,
max_local_storage_nodes=1, master=false}], channel closed event <<<<<
From now on node3 won't be synchronized
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4]
marking shard [laundry1s][0] as active indexing wise
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4]
marking shard [laundry1s][1] as active indexing wise
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4]
recalculating shard indexing buffer (reason=active/inactive[true]
created/deleted[false]), total is [203.1mb] with [2] active shards, each
shard set to indexing=[101.5mb], translog=[64kb]
[2013-12-17 10:27:43,347][DEBUG][index.engine.robin ] [node4]
[laundry1s][0] updating index_buffer_size from [500kb] to [101.5mb]
[2013-12-17 10:27:43,347][DEBUG][index.engine.robin ] [node4]
[laundry1s][1] updating index_buffer_size from [500kb] to [101.5mb]
[2013-12-17 10:27:47,984][DEBUG][transport.netty ] [node4]
connected to node
[[node3][0rOpsTurSKGWXOaGYVN90g][inet[/1x1.17.2x0.1x7:9302]]{tag=node3,
max_local_storage_nodes=1, master=false}]
[2013-12-17 10:27:49,530][DEBUG][transport.netty ] [node4]
disconnected from
[[node2][cVrCQo_bQBqM2V6ku-FvAQ][inet[/1x1.17.2x0.1x7:9301]]{tag=node2,
max_local_storage_nodes=1, master=false}], channel closed event <<<<< From
now on node2 won't be synchronized
[2013-12-17 10:28:01,288][DEBUG][transport.netty ] [node4]
connected to node
[[node2][cVrCQo_bQBqM2V6ku-FvAQ][inet[/1x1.17.2x0.1x7:9301]]{tag=node2,
max_local_storage_nodes=1, master=false}]

This is reflected in the amount of documents each shard has:

https://gist.github.com/mauriciojost/8003895

I have waited for about 2 hours, and no synchronization took place.

My question: is there a way to prevent such disconnections to happen? I
would like to make my ES cluster more tolerant to network short-term
irregularities. Timeout configurations?

Thanks,

Mauricio

P.S.

./elasticsearch -v
Version: 0.90.5, Build: c8714e8/2013-09-17T12:50:20Z, JVM: 1.6.0_43

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b6af6650-fccd-401d-8e9a-9a7302d13127%40googlegroups.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/A-nBEUnZ5to/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/CAEM624aQ-qLGq%2B4COY2HLHmgvPkJ%3DsvisCGQZ1_2go-woq-0Tg%40mail.gmail.com
.
For more options, visit https://groups.google.com/groups/opt_out.

--
Mauricio Jost

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CADJLS4%2BKWUDqrbgMGH-xjOyjkzAkUAjGFNA%2BpBtys_WynDv_WQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #4