Frequent disconnects between nodes

Runar_Myklebust · August 13, 2013, 9:11pm

Hi.

We have a setup with embedded ES (0.20.6) with 7 nodes. At a regular basis,
one or more nodes looses connection to the master, and thus (since the
minimum_master_nodes setting is 1) creates a split-brain scenario - The
split-brain issue is ok for now, but the frequency of disconnects is not.

There are "no indications" that the network is causing this, but one could
of course never be sure, although is a bit of an hassle to get the service
provider to accept this anyway without any more evidence.

The cluster uses unicast to communicate, and in an attemt to avoid these
regular node fallout we have set the discovery.zen.fd.ping_timout = 240s
but it doesnt seem to make things any better. Also, the nodes failing are
just as often located within the same network as nodes on a different
subnet.

Here is a view of the log from the node falling out of the cluster:

gist.github.com

https://gist.github.com/runarmyklebust/6225706

nodelog.java

2013-08-13 06:32:24,335 WARN [org.elasticsearch.discovery.zen] (elasticsearch[local][clusterService#updateTask][T#1]) [local] received cluster state from [[local][AaxhMxGQTA-IvO88ORiL5w][inet[/10.51.9.66:8800]]{local=false}] which is also master but with an older cluster_state, telling [[local][AaxhMxGQTA-IvO88ORiL5w][inet[/10.51.9.66:8800]]{local=false}] to rejoin the cluster
2013-08-13 06:32:24,338 WARN [org.elasticsearch.discovery.zen] (elasticsearch[local][generic][T#460]) [local] failed to send rejoin request to [[local][AaxhMxGQTA-IvO88ORiL5w][inet[/10.51.9.66:8800]]{local=false}]: org.elasticsearch.transport.SendRequestTransportException: [local][inet[/10.51.9.66:8800]][discovery/zen/rejoin]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:199) [elasticsearch-0.20.6.jar:]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:171) [elasticsearch-0.20.6.jar:]
at org.elasticsearch.discovery.zen.ZenDiscovery$7.execute(ZenDiscovery.java:526) [elasticsearch-0.20.6.jar:]
at org.elasticsearch.cluster.service.InternalClusterService$2.run(InternalClusterService.java:223) [elasticsearch-0.20.6.jar:]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [rt.jar:1.7.0_25]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [rt.jar:1.7.0_25]
at java.lang.Thread.run(Thread.java:724) [rt.jar:1.7.0_25]
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [local][inet[/10.51.9.66:8800]] Node not connected

This file has been truncated. show original

Im a bit stuck on how to proceed to find the reason for these disconnects
all the time (that is, surely a couple of nodes will fail within a day or
two). Any pointers would be appreciated.

greetings

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Runar_Myklebust · August 14, 2013, 9:44am

Another thing to point out;

I see some mentions of 'transport.ping_timeout' for client setting. How
does this compare / when does this come into account vs the
'discovery.zen.fd.ping_timeout' ? Is tuning this "transport.ping_timeout"
worth a shot?

For out embedded client, we are not setting any specific options at all for
now, we initialize the client like this:

public void start()
{
this.client = this.node.client();
}

The node is initalized with a lot of settings though, through:

this.node = NodeBuilder.nodeBuilder().settings( settings ).build();

Will it be sufficient to add the "transport.ping_timeout" to the node
settings?

On Tue, Aug 13, 2013 at 11:11 PM, Runar Myklebust runar.a.m@gmail.comwrote:

Hi.

We have a setup with embedded ES (0.20.6) with 7 nodes. At a regular
basis, one or more nodes looses connection to the master, and thus (since
the minimum_master_nodes setting is 1) creates a split-brain scenario - The
split-brain issue is ok for now, but the frequency of disconnects is not.

There are "no indications" that the network is causing this, but one could
of course never be sure, although is a bit of an hassle to get the service
provider to accept this anyway without any more evidence.

The cluster uses unicast to communicate, and in an attemt to avoid these
regular node fallout we have set the discovery.zen.fd.ping_timout = 240s
but it doesnt seem to make things any better. Also, the nodes failing are
just as often located within the same network as nodes on a different
subnet.

Here is a view of the log from the node falling out of the cluster:

Log from node falling out of cluster · GitHub

Im a bit stuck on how to proceed to find the reason for these disconnects
all the time (that is, surely a couple of nodes will fail within a day or
two). Any pointers would be appreciated.

greetings

Runar Myklebust

--
mvh

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · August 14, 2013, 12:38pm

From what I understand, you have a split brain situation. A second master
wants to rejoin, the state difference is detected, but the rejoin request
fails. To fix this, my only idea is to isolate the second master and remove
state and data from that node before rejoining.

Increasing timeouts may be a workaround for the moment. But if your network
is not reliable, you should diagnose and fix that. Maybe network monitoring
tools can help you, like tcpdump, ntop, wireshark... only to mention a few.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Runar_Myklebust · August 14, 2013, 1:13pm

Hi Jörg, thanks for answering.
The split-brain is "ok", in the sense that I know that it will happen with
the current setup if the nodes drops out.
The reason for the disconnects on the other hand, are more of a concern
right now. Im a bit stuck on finding out why exactly the disconnects
happens since I have very limited access to the system, and has to ask for
assistance on what to check for explicitly.
At the moment, the log level for discovery and transport has been altered
to "finest", so that will hopefully reveal some more information.

On Wed, Aug 14, 2013 at 2:38 PM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

From what I understand, you have a split brain situation. A second master
wants to rejoin, the state difference is detected, but the rejoin request
fails. To fix this, my only idea is to isolate the second master and remove
state and data from that node before rejoining.

Increasing timeouts may be a workaround for the moment. But if your
network is not reliable, you should diagnose and fix that. Maybe network
monitoring tools can help you, like tcpdump, ntop, wireshark... only to
mention a few.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
mvh

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · August 14, 2013, 3:31pm

Just to add one more thought. Network disconnects may only be a symptom and
examining the network subsystem may not reveal anything useful.

You said you have limited access and ES is embedded. It could also be that
ES is drained from necessary resources (CPU, memory, disk) by other
software or activity of the system, and you see only network disconnects,
but not the cause of the problem.

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Ivan · August 15, 2013, 6:55pm

There are two levels of timeouts. The first checks if the process is
running and accepting connections. The second is if the machine itself is
actually running.

If the process itself is slow, then the first timeout will occur and might
have little to do with network. As Jörg mentioned, if elasticsearch is
constrained, it might not be responding in a timely matter. If you are
limited in the amount of memory you have, you might be experience large
garbage collections. Upgrading to 0.90.x will improve memory usage and
reduce the amount/time need for GC.

Cheers,

Ivan

On Wed, Aug 14, 2013 at 8:31 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Just to add one more thought. Network disconnects may only be a symptom
and examining the network subsystem may not reveal anything useful.

You said you have limited access and ES is embedded. It could also be that
ES is drained from necessary resources (CPU, memory, disk) by other
software or activity of the system, and you see only network disconnects,
but not the cause of the problem.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Runar_Myklebust · August 16, 2013, 7:17am

Ok, so the "transport.ping_timeout" checks if the process is running then?

Now, after a run with tracing, I can see a lot of these in the logs:

2013-08-14 14:47:19,224 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][transport_client_worker][T#2]{New I/O worker #2})
[local] disconnected from
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}],
channel closed event
2013-08-14 14:47:24,080 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3})
[local] disconnected from
[[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false}],
channel closed event

After a while, the node that the current master gets the "channel closed
event" from, looses connection with master and starts master election and
makes all kinds of trouble. As far as I can see, no GC or other slowdowns
has happened on the NODE24 or NODE26, so this seems to be a network issue,
right?

On Thu, Aug 15, 2013 at 8:55 PM, Ivan Brusic ivan@brusic.com wrote:

There are two levels of timeouts. The first checks if the process is
running and accepting connections. The second is if the machine itself is
actually running.

If the process itself is slow, then the first timeout will occur and might
have little to do with network. As Jörg mentioned, if elasticsearch is
constrained, it might not be responding in a timely matter. If you are
limited in the amount of memory you have, you might be experience large
garbage collections. Upgrading to 0.90.x will improve memory usage and
reduce the amount/time need for GC.

Cheers,

Ivan

On Wed, Aug 14, 2013 at 8:31 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

Just to add one more thought. Network disconnects may only be a symptom
and examining the network subsystem may not reveal anything useful.

You said you have limited access and ES is embedded. It could also be
that ES is drained from necessary resources (CPU, memory, disk) by other
software or activity of the system, and you see only network disconnects,
but not the cause of the problem.

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
mvh

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · August 16, 2013, 9:13am

The "channel closed event" notice is a normal event that is happening
during connection pool activity and just an informational message on
"DEBUG" log level.

It seems unrelated to the "unable to rejoin cluster" situations you have
posted.

Do you get failure/error messages in the log?

Jörg

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Runar_Myklebust · August 16, 2013, 9:32am

Hmmm..

Here is first traces of that something is visibly going wrong, around 19:10:

Node47:
2013-08-14 19:09:49,243 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3})
[local] disconnected from
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}],
channel closed event
2013-08-14 19:09:54,109 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3})
[local] disconnected from
[[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false}],
channel closed event
2013-08-14 19:10:06,008 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4})
[local] disconnected from
[[local][da-T28GDRtWgadrkCvxS-w][inet[/NODE25:8800]]{local=false}],
channel closed event
2013-08-14 19:10:34,253 TRACE [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][generic][T#19]) [local] [node ]
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}]
transport disconnected (with verified connect)
2013-08-14 19:10:34,259 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#24]) [local] connected to node
[[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false}]
2013-08-14 19:10:34,259 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#25]) [local] connected to node
[[local][da-T28GDRtWgadrkCvxS-w][inet[/NODE25:8800]]{local=false}]
2013-08-14 19:10:34,273 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#26]) [local] connected to node
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}]
2013-08-14 19:10:34,290 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#27]) [local] disconnected from
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}]

Node24:
2013-08-14 19:10:35,167 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4})
[local] [master] pinging a master
[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/NODE47:8800]]{local=false} but we
do not exists on it, act as if its master failure
2013-08-14 19:10:35,170 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4})
[local] [master] stopping fault detection against master
[[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/NODE47:8800]]{local=false}],
reason [master failure, do not exists on master, act as master failure]
2013-08-14 19:10:35,171 INFO [org.elasticsearch.discovery.zen]
(elasticsearch[local][generic][T#1]) [local] master_left
[[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/NODE47:8800]]{local=false}],
reason [do not exists on master, act as master failure]
2013-08-14 19:10:35,174 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][clusterService#updateTask][T#1]) [local] [master]
restarting fault detection against master
[[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/NODE45:8800]]{local=false}],
reason [possible elected master since master left (reason = do not exists
on master, act as master failure)]
2013-08-14 19:10:35,181 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#1]) [local] disconnected from
[[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/NODE47:8800]]{local=false}]
2013-08-14 19:10:36,233 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4})
[local] [master] pinging a master
[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/NODE45:8800]]{local=false} that
is no longer a master
2013-08-14 19:10:36,235 INFO [org.elasticsearch.discovery.zen]
(elasticsearch[local][generic][T#5]) [local] master_left
[[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/NODE45:8800]]{local=false}],
reason [no longer master]
2013-08-14 19:10:36,235 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][transport_client_worker][T#4]{New I/O worker #4})
[local] [master] stopping fault detection against master
[[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/NODE45:8800]]{local=false}],
reason [master failure, no longer master]
2013-08-14 19:10:36,241 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][clusterService#updateTask][T#1]) [local] [master]
restarting fault detection against master
[[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false}],
reason [possible elected master since master left (reason = no longer
master)]
2013-08-14 19:10:36,245 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#5]) [local] disconnected from
[[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/NODE45:8800]]{local=false}]
2013-08-14 19:10:37,359 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3})
[local] [master] pinging a master
[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false} that
is no longer a master
2013-08-14 19:10:37,361 INFO [org.elasticsearch.discovery.zen]
(elasticsearch[local][generic][T#10]) [local] master_left
[[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false}],
reason [no longer master]
2013-08-14 19:10:37,363 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][transport_client_worker][T#3]{New I/O worker #3})
[local] [master] stopping fault detection against master
[[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false}],
reason [master failure, no longer master]
2013-08-14 19:10:37,393 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#10]) [local] disconnected from
[[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false}]

Node25:
2013-08-14 19:10:34,309 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#1]) [local] disconnected from
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}]
2013-08-14 19:10:37,387 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][clusterService#updateTask][T#1]) [local] [master]
restarting fault detection against master
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}],
reason [new cluster stare received and we monitor the wrong master
[[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/NODE47:8800]]{local=false}]]
2013-08-14 19:10:37,397 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][clusterService#updateTask][T#1]) [local] connected to
node [[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}]
2013-08-14 19:10:37,405 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#10]) [local] disconnected from
[[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/NODE47:8800]]{local=false}]
2013-08-14 19:10:37,410 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#10]) [local] disconnected from
[[local][V7FXnZiLR-GVIyZ2DOwV2w][inet[/NODE26:8800]]{local=false}]
2013-08-14 19:10:37,424 DEBUG [org.elasticsearch.transport.netty]
(elasticsearch[local][generic][T#10]) [local] disconnected from
[[local][JrRrD5Y8R8WHn1ZAkjYNBw][inet[/NODE45:8800]]{local=false}]

On Fri, Aug 16, 2013 at 11:13 AM, joergprante@gmail.com <
joergprante@gmail.com> wrote:

The "channel closed event" notice is a normal event that is happening
during connection pool activity and just an informational message on
"DEBUG" log level.

It seems unrelated to the "unable to rejoin cluster" situations you have
posted.

Do you get failure/error messages in the log?

Jörg

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
mvh

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · August 16, 2013, 11:12am

You have a split brain with another master, and you should repair your
broken node real soon.

Elasticsearch does not accept the node connects because it tries to
announce being a master.

Jörg

Am 16.08.2013 11:32, schrieb Runar Myklebust:

2013-08-14 19:10:37,387 DEBUG [org.elasticsearch.discovery.zen.fd]
(elasticsearch[local][clusterService#updateTask][T#1]) [local]
[master] restarting fault detection against master
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}], reason
[new cluster stare received and we monitor the wrong master
[[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/NODE47:8800]]{local=false}]]

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Runar_Myklebust · August 16, 2013, 11:24am

Hi again Jörg.
The split-brain is "expected", but the reason for the disconnect is still
unclear. The strange thing is, that I see no traces of zen discovery ping
attempts. No indications that anything is wrong before the "master failure"
message.
And NODE47 is seemingly working normal both before and after the disconnect.

On Fri, Aug 16, 2013 at 1:12 PM, Jörg Prante joergprante@gmail.com wrote:

You have a split brain with another master, and you should repair your
broken node real soon.

Elasticsearch does not accept the node connects because it tries to
announce being a master.

Jörg

Am 16.08.2013 11:32, schrieb Runar Myklebust:

2013-08-14 19:10:37,387 DEBUG [org.elasticsearch.discovery.**zen.fd]

(elasticsearch[local][**clusterService#updateTask][T#**1]) [local]
[master] restarting fault detection against master
[[local][VbxjXeqGRIyNFzvK-1JCIw][inet[/NODE24:8800]]{local=false}],
reason [new cluster stare received and we monitor the wrong master
[[local][Y01TgbUzRg-JIIpQ7NqlZg][inet[/NODE47:
8800]]{local=false}]]

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
mvh

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

jprante · August 16, 2013, 1:29pm

Well that is the reason for the disconnect. Your second master node has
a defect internal cluster state. This is detected by the other master
when it tries to reconnect, and the node is disconnected so it can't
join the cluster again.

Have you cleared the defect node from any data?

Jörg

Am 16.08.2013 13:24, schrieb Runar Myklebust:

The split-brain is "expected", but the reason for the disconnect is
still unclear. The strange thing is, that I see no traces of zen
discovery ping attempts. No indications that anything is wrong before
the "master failure" message.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Runar_Myklebust · August 19, 2013, 7:14am

As far as I can read of the logs; this is whats happening:

at 19:09:49,243 - a channel closed event is received from NODE24 to NODE47
(Master) and it is disconnected
at 19:10:34,273 - a connection to NODE24 is done, then
at 19:10:34,290 - we get a "disconnected" from NODE24

19:10:35,167 - NODE24 pings master (NODE47) but the master does not have
NODE24 in its list of nodes, and threats this like a master failure.

This scenario always happens after a while, also when the nodes are
initialized by deleting index-files etc. Also, this happens only on certain
nodes, that is: between nodes in different networks.

For the other nodes, the NODE47 still works as a master, also a restart on
NODE24 will fix the issue. I still dont understand why the disconnect of
NODE24 from NODE47 happens in the first place. Everything happens within a
second or two, so any timeout seems unlikely.

On Fri, Aug 16, 2013 at 3:29 PM, Jörg Prante joergprante@gmail.com wrote:

Well that is the reason for the disconnect. Your second master node has a
defect internal cluster state. This is detected by the other master when it
tries to reconnect, and the node is disconnected so it can't join the
cluster again.

Have you cleared the defect node from any data?

Jörg

Am 16.08.2013 13:24, schrieb Runar Myklebust:

The split-brain is "expected", but the reason for the disconnect is still

unclear. The strange thing is, that I see no traces of zen discovery ping
attempts. No indications that anything is wrong before the "master failure"
message.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@**googlegroups.com elasticsearch%2Bunsubscribe@googlegroups.com
.
For more options, visit https://groups.google.com/**groups/opt_out https://groups.google.com/groups/opt_out
.

--
mvh

Runar Myklebust

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Topic		Replies	Views
ES nodes disconnects intermittently from the cluster Elasticsearch	1	630	February 8, 2018
Discovery_zen disconnect issues Elasticsearch	5	382	July 6, 2017
Sporadic node disconnected issues Elasticsearch	3	586	July 5, 2017
Split-brain situation - forcing discovery and rejoin Elasticsearch	3	638	July 6, 2017
[SOLVED] Frequent node disconnects on Rackspace environment Elasticsearch	4	1826	July 5, 2017

Frequent disconnects between nodes

greetings

greetings

-- mvh

-- mvh

-- mvh

-- mvh

-- mvh

-- mvh

Related topics

--
mvh

--
mvh

--
mvh

--
mvh

--
mvh

--
mvh