Cannot relocate shard bcause of unstable network


(Rzulf) #1

I have a ES cluster in data center in France, I want to attach node to the
cluster from data center in Poland. Network statistics between two servers
are the following: ping ~35ms, transfer ~80Mbit/s

The problem is that while relocating shard (120GB) at random point I get
closed channel exceptions and time-outs even through the connection between
hosts seems to be stable (ping works, other apps as well). Suddenly
connections with node from which I copy shard is lost, and connection with
cluster clients is lost. What is strange that other nodes with which I lost
connection are not aware of that and continue working like nothing
happened. I have seen that in issue
https://github.com/elasticsearch/elasticsearch/issues/2733 this problem was
fixed, but would it help in my situation? If network error occur would
relocating shard be resumed or restarted from scratch? Relocating shard
takes about 3 hours, so if my network connection cannot sustain such long
transfer is there any sense in attaching node to cluster?

I user ES version 0.20.4

Regards
Michał

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Rzulf) #2

Anyone? What happens when connection is temporarily lost during relocation?

Michał

W dniu czwartek, 31 października 2013 14:13:35 UTC+1 użytkownik Michał
napisał:

I have a ES cluster in data center in France, I want to attach node to the
cluster from data center in Poland. Network statistics between two servers
are the following: ping ~35ms, transfer ~80Mbit/s

The problem is that while relocating shard (120GB) at random point I get
closed channel exceptions and time-outs even through the connection between
hosts seems to be stable (ping works, other apps as well). Suddenly
connections with node from which I copy shard is lost, and connection with
cluster clients is lost. What is strange that other nodes with which I lost
connection are not aware of that and continue working like nothing
happened. I have seen that in issue
https://github.com/elasticsearch/elasticsearch/issues/2733 this problem
was fixed, but would it help in my situation? If network error occur would
relocating shard be resumed or restarted from scratch? Relocating shard
takes about 3 hours, so if my network connection cannot sustain such long
transfer is there any sense in attaching node to cluster?

I user ES version 0.20.4

Regards
Michał

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Alexander Reelsen) #3

Hey,

it is not recommended to run a cluster in a cross data centre environment
at the moment. Also, the elasticsearch version you are using is really old,
you should upgrade. There are a couple of workarounds to the cross data
center problem to prevent restarting something like a relocation over and
over again because of an unstable permanent network connection:

  1. Application level replication. You simply index your data into both
    clusters (paris and poland). Maybe using something like an MQ mechanism
    makes sense in order to prevent waits to index in the long distanced data
    center.

  2. If there is no need to be realtime, the new Snapshot/Restore API, which
    will come with elasticsearch 1.0 might be good to keep in mind.
    https://github.com/elasticsearch/elasticsearch/issues/3826

--Alex

On Mon, Nov 4, 2013 at 7:16 PM, Michał mbrzezicki@gmail.com wrote:

Anyone? What happens when connection is temporarily lost during relocation?

Michał

W dniu czwartek, 31 października 2013 14:13:35 UTC+1 użytkownik Michał
napisał:

I have a ES cluster in data center in France, I want to attach node to
the cluster from data center in Poland. Network statistics between two
servers are the following: ping ~35ms, transfer ~80Mbit/s

The problem is that while relocating shard (120GB) at random point I get
closed channel exceptions and time-outs even through the connection between
hosts seems to be stable (ping works, other apps as well). Suddenly
connections with node from which I copy shard is lost, and connection with
cluster clients is lost. What is strange that other nodes with which I lost
connection are not aware of that and continue working like nothing
happened. I have seen that in issue https://github.com/
elasticsearch/elasticsearch/issues/2733 this problem was fixed, but
would it help in my situation? If network error occur would relocating
shard be resumed or restarted from scratch? Relocating shard takes about 3
hours, so if my network connection cannot sustain such long transfer is
there any sense in attaching node to cluster?

I user ES version 0.20.4

Regards
Michał

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(Rzulf) #4

Thanks for reply. I managed to set up cluster by setting
network.tcp.keep_alive to true and changing system keepalive from 2 hours
to 30 minutes (/proc/sys/net/ipv4/tcp_keepalive_time). The cluster now
works fine for few days, and there is no noticable loss in prerformance.

Michał

2013/11/14 Alexander Reelsen alr@spinscale.de

Hey,

it is not recommended to run a cluster in a cross data centre environment
at the moment. Also, the elasticsearch version you are using is really old,
you should upgrade. There are a couple of workarounds to the cross data
center problem to prevent restarting something like a relocation over and
over again because of an unstable permanent network connection:

  1. Application level replication. You simply index your data into both
    clusters (paris and poland). Maybe using something like an MQ mechanism
    makes sense in order to prevent waits to index in the long distanced data
    center.

  2. If there is no need to be realtime, the new Snapshot/Restore API, which
    will come with elasticsearch 1.0 might be good to keep in mind.
    https://github.com/elasticsearch/elasticsearch/issues/3826

--Alex

On Mon, Nov 4, 2013 at 7:16 PM, Michał mbrzezicki@gmail.com wrote:

Anyone? What happens when connection is temporarily lost during
relocation?

Michał

W dniu czwartek, 31 października 2013 14:13:35 UTC+1 użytkownik Michał
napisał:

I have a ES cluster in data center in France, I want to attach node to
the cluster from data center in Poland. Network statistics between two
servers are the following: ping ~35ms, transfer ~80Mbit/s

The problem is that while relocating shard (120GB) at random point I get
closed channel exceptions and time-outs even through the connection between
hosts seems to be stable (ping works, other apps as well). Suddenly
connections with node from which I copy shard is lost, and connection with
cluster clients is lost. What is strange that other nodes with which I lost
connection are not aware of that and continue working like nothing
happened. I have seen that in issue https://github.com/
elasticsearch/elasticsearch/issues/2733 this problem was fixed, but
would it help in my situation? If network error occur would relocating
shard be resumed or restarted from scratch? Relocating shard takes about 3
hours, so if my network connection cannot sustain such long transfer is
there any sense in attaching node to cluster?

I user ES version 0.20.4

Regards
Michał

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.

For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/a-neLZ8mXl8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


(system) #5