ELK node leaves and rejoin the cluster every hour

Something is not right and i would like some help figuring out what/why.

I have 3 ELK nodes (E 2.1.0 ; L 2.1.1), and every hour the KOPF plugin shows that they leave the cluster and rejoin after a few seconds.

They don't leave at the same time, there is an interval between them.

Currently the uptime for my nodes is

elastic-node-01 : 13 minutes
elastic-node-02 : 28 minutes
elastic-node-03 : 15 minutes

the file /var/log/logstash/logstash.err shows the following:

Dez 15, 2015 12:02:02 AM org.apache.http.impl.execchain.RetryExec execute
INFORMAÇÕES: I/O exception (java.net.SocketException) caught when processing request to {}->http://127.0.0.1:9200: Broken pipe

Dez 15, 2015 12:02:02 AM org.apache.http.impl.execchain.RetryExec execute
INFORMAÇÕES: Retrying request to {}->http://127.0.0.1:9200

Dez 15, 2015 1:01:54 AM org.apache.http.impl.execchain.RetryExec execute
INFORMAÇÕES: I/O exception (java.net.SocketException) caught when processing request to {}->http://127.0.0.1:9200: Broken pipe

Dez 15, 2015 1:01:54 AM org.apache.http.impl.execchain.RetryExec execute
INFORMAÇÕES: Retrying request to {}->http://127.0.0.1:9200

and /var/log/elasticsearch/myclustername.log says:

2015-12-15 00:00:56,530[INFO ][discovery.zen ] [elastic-node-01] master_left [{elastic-node-03}{Va5P4dq8QkmbYPLjb2skCw}{10.20.30.163}{10.20.30.163:9300}], reason [shut_down]

2015-12-15 00:00:56,546[WARN ][discovery.zen ] [elastic-node-01] master left (reason = shut_down), current nodes: {{elastic-node-02}{nkZ50XnqTcOo4niZuchMyw}{10.20.30.162}{10.20.30.162:9300},{elastic-node-01}{xMOAjXAGRnucrXe8IDnwww}{10.20.30.161}{10.20.30.161:9300},}

2015-12-15 00:00:56,548[INFO ][cluster.service ] [elastic-node-01] removed {{elastic-node-03}{Va5P4dq8QkmbYPLjb2skCw}{10.20.30.163}{10.20.30.163:9300},}, reason: zen-disco-master_failed ({elastic-node-03}{Va5P4dq8QkmbYPLjb2skCw}{10.20.30.163}{10.20.30.163:9300})

2015-12-15 00:00:56,552[WARN ][discovery.zen.ping.unicast] [elastic-node-01] failed to send ping to [{elastic-node-03}{Va5P4dq8QkmbYPLjb2skCw}{10.20.30.163}{10.20.30.163:9300}]

RemoteTransportException[[elastic-node-03][10.20.30.163:9300][internal:discovery/zen/unicast]]; nested: IllegalStateException[received ping request while not started];

followed by a java exception:

Caused by: java.lang.IllegalStateException: received ping request while not started
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.handlePingRequest(UnicastZenPing.java:497)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing.access$2400(UnicastZenPing.java:83)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$UnicastPingRequestHandler.messageReceived(UnicastZenPing.java:522)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$UnicastPingRequestHandler.messageReceived(UnicastZenPing.java:518)
at org.elasticsearch.transport.netty.MessageChannelHandler.handleRequest(MessageChannelHandler.java:244)
at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:114)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
at org.jboss.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
at org.jboss.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:75)
at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)

Any hints?

found the problem, and its not related to logstash.

now i cant find a way to delete this topic...

What was it, it may be helpful for others in the future!

it was a problem with puppet.

before the update i didn't have the subdir scripts in the /etc/elasticsearch/ directory.

i configured puppet to not delete the scripts dir, but it was still triggering a refresh on service elasticsearch.

adding the new directory to the puppet class on the puppetmaster solved the problem.