Hello,
I am working on a platform that has 4 nodes. The node1 is master only, no
data. node2, node3 and node4 are data only nodes (not master). However
node4, has poorer network connectivity conditions to the 3 nodes, than the
one they have among themselves.
I have only one index, with 2 shards with replica 2 (so all data nodes,
node2, node3 and node4) have some shards. The primary shards are only in
node4, as it is where most of the queries will take place, also inserts.
Inserts are done with replication=async mode, so for a while data that is
written in node4 is not synchronized with the other replica shards.
Here it is my problem. I did some performance experiments for our
particular usecase. I wanted to know how long it would take for the system
to have a consistent replica's state after some inserts. So here it is what
I did: after I wrote about 10000 documents to node4, I waited for them to
be replicated to node2 and node3 and I measured the time. This replication
indeed happens and takes a more or less expected duration. However in some
cases when there is disconnetion of the primary shard's node (node4 in this
case) with the data nodes, synchronization would just stop working. I know
that will happen always when I see the following in node4 (primary shards
node):
[2013-12-17 10:27:40,599][DEBUG][transport.netty ] [node4]
disconnected from
[[node3][0rOpsTurSKGWXOaGYVN90g][inet[/1x1.17.2x0.1x7:9302]]{tag=node3,
max_local_storage_nodes=1, master=false}], channel closed event <<<<<
From now on node3 won't be synchronized
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4] marking
shard [laundry1s][0] as active indexing wise
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4] marking
shard [laundry1s][1] as active indexing wise
[2013-12-17 10:27:43,347][DEBUG][indices.memory ] [node4]
recalculating shard indexing buffer (reason=active/inactive[true]
created/deleted[false]), total is [203.1mb] with [2] active shards, each
shard set to indexing=[101.5mb], translog=[64kb]
[2013-12-17 10:27:43,347][DEBUG][index.engine.robin ] [node4]
[laundry1s][0] updating index_buffer_size from [500kb] to [101.5mb]
[2013-12-17 10:27:43,347][DEBUG][index.engine.robin ] [node4]
[laundry1s][1] updating index_buffer_size from [500kb] to [101.5mb]
[2013-12-17 10:27:47,984][DEBUG][transport.netty ] [node4]
connected to node
[[node3][0rOpsTurSKGWXOaGYVN90g][inet[/1x1.17.2x0.1x7:9302]]{tag=node3,
max_local_storage_nodes=1, master=false}]
[2013-12-17 10:27:49,530][DEBUG][transport.netty ] [node4]
disconnected from
[[node2][cVrCQo_bQBqM2V6ku-FvAQ][inet[/1x1.17.2x0.1x7:9301]]{tag=node2,
max_local_storage_nodes=1, master=false}], channel closed event <<<<< From
now on node2 won't be synchronized
[2013-12-17 10:28:01,288][DEBUG][transport.netty ] [node4]
connected to node
[[node2][cVrCQo_bQBqM2V6ku-FvAQ][inet[/1x1.17.2x0.1x7:9301]]{tag=node2,
max_local_storage_nodes=1, master=false}]
This is reflected in the amount of documents each shard has:
I have waited for about 2 hours, and no synchronization took place.
My question: is there a way to prevent such disconnections to happen? I
would like to make my ES cluster more tolerant to network short-term
irregularities. Timeout configurations?
Thanks,
Mauricio
P.S.
./elasticsearch -v
Version: 0.90.5, Build: c8714e8/2013-09-17T12:50:20Z, JVM: 1.6.0_43
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b6af6650-fccd-401d-8e9a-9a7302d13127%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.