Indexes deleted (empty) on cluster restart


(Michel Conrad) #1

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of 0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:504)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,
Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red
Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom
Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel
Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added {[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Shay Banon) #2

Heya,

First, lest try and get the discovery working, since disabling multicast should not give those failures. Can you gist your config (it gets messed up in the mail)?

Just ran a quick 40 nodes test with 200 indices and it seems to always recover the data... . How do you shutdown the cluster?

-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of 0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:504)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,
Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red
Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom
Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel
Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added {[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Michel Conrad) #3

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling multicast
should not give those failures. Can you gist your config (it gets messed up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of 0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,
Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red
Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom
Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel
Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added {[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Michel Conrad) #4

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:657)
at org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)
at org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On Fri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling multicast
should not give those failures. Can you gist your config (it gets messed up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of 0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,
Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red
Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom
Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel
Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added {[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Shay Banon) #5

You configuration looks good, can you maybe start a simple single server and set discovery to TRACE in the logging.yml file and gist it? Multicast should be disabled.

I got a bit confused by the last mail. Are things working fine now if you don't have the nodes immediately start and then killed after shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEngine.java:657)
at org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)
at org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On Fri, Jul 15, 2011 at 10:46 AM, Michel Conrad
<michel.conrad@trendiction.com (mailto:michel.conrad@trendiction.com)> wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
<shay.banon@elasticsearch.com (mailto:shay.banon@elasticsearch.com)> wrote:

Heya,
First, lest try and get the discovery working, since disabling multicast
should not give those failures. Can you gist your config (it gets messed up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of 0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,
Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red
Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom
Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel
Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added {[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Michel Conrad) #6

Hi,
over the last two days I have been restarting the cluster twice, and
no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple single server and
set discovery to TRACE in the logging.yml file and gist it? Multicast should
be disabled.
I got a bit confused by the last mail. Are things working fine now if you
don't have the nodes immediately start and then killed after shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at
org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)
at
org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)
at
org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at
org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at
java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling multicast
should not give those failures. Can you gist your config (it gets messed up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of 0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at
org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at
org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at
org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at
org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at
org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at
org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at
org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at
org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,
Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red
Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom
Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel
Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added {[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Shay Banon) #7

Great, and strange...

Don't understand why multicast is still enabled, can you try setting this in
the config file (just as one setting string):

discovery.zen.ping.multicast.enabled: false

thanks!

On Mon, Jul 18, 2011 at 1:02 PM, Michel Conrad <
michel.conrad@trendiction.com> wrote:

Hi,
over the last two days I have been restarting the cluster twice, and
no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.
https://gist.github.com/1089067

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple single server
and
set discovery to TRACE in the logging.yml file and gist it? Multicast
should
be disabled.
I got a bit confused by the last mail. Are things working fine now if you
don't have the nodes immediately start and then killed after shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at

org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)

at

org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)

at

org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)

at

org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)

at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at

java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)

at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)

at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the
webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling
multicast
should not give those failures. Can you gist your config (it gets messed
up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to
always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of 0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)

at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)

at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)

at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)

at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)

at

org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)

at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)

at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)

at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)

at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)

at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)

at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)

at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)

at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)

at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)

at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)

at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,
Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300
]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300
]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300
]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300
]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red
Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300
]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300
]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom
Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300
]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300
]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel
Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300
]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added {[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300
]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Michel Conrad) #8

Hi Shay,
My bad, multicast was still enabled because of an error in the
configuration file, there was a whitespace missing, "enabled:false"
should have been "enabled: false". Maybe elasticsearch could give out
a warning if the configuration file has an error instead of silently
ignoring it.

Thanks,
Michel

On Mon, Jul 18, 2011 at 7:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Great, and strange...
Don't understand why multicast is still enabled, can you try setting this in
the config file (just as one setting string):
discovery.zen.ping.multicast.enabled: false
thanks!

On Mon, Jul 18, 2011 at 1:02 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi,
over the last two days I have been restarting the cluster twice, and
no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.
https://gist.github.com/1089067

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple single server
and
set discovery to TRACE in the logging.yml file and gist it? Multicast
should
be disabled.
I got a bit confused by the last mail. Are things working fine now if
you
don't have the nodes immediately start and then killed after shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at

org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)
at

org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)
at

org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at

org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at

java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the
webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling
multicast
should not give those failures. Can you gist your config (it gets messed
up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to
always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of
0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,

Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red

Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom

Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel

Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added
{[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Shay Banon) #9

It can't really give an error (well, not easily). Its hard since its
considered a single key then, and you need to do the reverse (i.e. find out
which settings are not being used). Possible, but gets complicated when each
module is independent and pluggable.

On Tue, Jul 19, 2011 at 11:10 AM, Michel Conrad <
michel.conrad@trendiction.com> wrote:

Hi Shay,
My bad, multicast was still enabled because of an error in the
configuration file, there was a whitespace missing, "enabled:false"
should have been "enabled: false". Maybe elasticsearch could give out
a warning if the configuration file has an error instead of silently
ignoring it.

Thanks,
Michel

On Mon, Jul 18, 2011 at 7:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Great, and strange...
Don't understand why multicast is still enabled, can you try setting this
in
the config file (just as one setting string):
discovery.zen.ping.multicast.enabled: false
thanks!

On Mon, Jul 18, 2011 at 1:02 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi,
over the last two days I have been restarting the cluster twice, and
no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.
https://gist.github.com/1089067

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple single
server

and
set discovery to TRACE in the logging.yml file and gist it? Multicast
should
be disabled.
I got a bit confused by the last mail. Are things working fine now if
you
don't have the nodes immediately start and then killed after shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the
loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at

org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)

at

org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)

at

org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)

at

org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)

at

java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at

java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)

at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)

at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the
webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling
multicast
should not give those failures. Can you gist your config (it gets
messed

up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to
always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of
0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow, and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having 36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)

at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)

at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)

at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)

at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)

at

org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)

at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)

at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)

at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)

at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)

at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)

at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)

at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)

at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)

at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)

at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)

at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,

Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300
]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300
]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300
]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300
]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,

Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red

Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300
]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300
]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom

Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300
]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300
]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible

Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel

Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300
]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},

reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added
{[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added {[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300
]],},

reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Michel Conrad) #10

Hi Shay,

The issue with the empty indices reappeared, I lost an index again.
Yesterday I have been programmatically deleting some indices, which
were correctly deleted.
After restarting elasticsearch another index
(dsearch_en_00a858cc8000), which I didn't delete, has been empty.

Grepping over the log files gives the following error.
[2011-07-19 19:13:32,999][DEBUG][action.search.type ] [Hank
McCoy] [dsearch_en_00a858cc8000][0], node[WPUPrpBDT-G5jAgCDdxZXg],
[P], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@628b54f4]
org.elasticsearch.transport.RemoteTransportException:
[Xemu][inet[/192.168.5.14:9300]][search/phase/query]
Caused by: org.elasticsearch.indices.IndexMissingException:
[dsearch_en_00a858cc8000] missing
at org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:208)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:377)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:218)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:447)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:438)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:236)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The master gave the following output, trying to send the cluster state
to the node later serving the primary shard:
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,986][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected

The data is still available on a node, but that node isn't serving the
data (I am using 1 replica and 1 shard, local gateway):
The nodes containing data:
192.168.6.5 (6.9G)
192.168.5.14 (28K) -> primary shard
192.168.5.12 (24K) -> secondary shard
192.168.5.8 (4.0K)

Also when I start up my cluster, the data is always recovered from a
primary shard and then streamed to another node. Is this
normal behaviour that the whole data is replicated on startup, instead
of using the data locally available?
Is it possible that the wrong(empty?) node is chosen as the primary
shard, or that the indices are not correctly deleted.

Can you please tell me, if you need some more debugging output and
what classes I can enable logging for.

Best,
Michel

On Tue, Jul 19, 2011 at 11:44 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

It can't really give an error (well, not easily). Its hard since its
considered a single key then, and you need to do the reverse (i.e. find out
which settings are not being used). Possible, but gets complicated when each
module is independent and pluggable.

On Tue, Jul 19, 2011 at 11:10 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,
My bad, multicast was still enabled because of an error in the
configuration file, there was a whitespace missing, "enabled:false"
should have been "enabled: false". Maybe elasticsearch could give out
a warning if the configuration file has an error instead of silently
ignoring it.

Thanks,
Michel

On Mon, Jul 18, 2011 at 7:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Great, and strange...
Don't understand why multicast is still enabled, can you try setting
this in
the config file (just as one setting string):
discovery.zen.ping.multicast.enabled: false
thanks!

On Mon, Jul 18, 2011 at 1:02 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi,
over the last two days I have been restarting the cluster twice, and
no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.
https://gist.github.com/1089067

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple single
server
and
set discovery to TRACE in the logging.yml file and gist it? Multicast
should
be disabled.
I got a bit confused by the last mail. Are things working fine now if
you
don't have the nodes immediately start and then killed after
shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the
loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving
and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at

org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)
at

org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)
at

org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at

org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at

java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at

java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have
elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the
webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling
multicast
should not give those failures. Can you gist your config (it gets
messed
up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to
always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems
to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of
0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow,
and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having
36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,

Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red

Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom

Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel

Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added
{[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added
{[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Stephane Bastian) #11

Hi all,

FYI, I simply want to say that I had the same issue 2 weeks ago while
testing 0.17 on an index created with 0.16. Everything worked fine until
I switched back to using 0.16, at which point all indexes were deleted.

At that time I though that the problem came from the fact that I
switched back and forth between ES 0.16/0.17. Looking at this thread it
may indeed be an ES bug.
Note that I had the problem only once. However if it happens again, I'll
post as many detailed info as possible.

Stephane Bastian

On Wed, 2011-07-20 at 10:57 +0200, Michel Conrad wrote:

Hi Shay,

The issue with the empty indices reappeared, I lost an index again.
Yesterday I have been programmatically deleting some indices, which
were correctly deleted.
After restarting elasticsearch another index
(dsearch_en_00a858cc8000), which I didn't delete, has been empty.

Grepping over the log files gives the following error.
[2011-07-19 19:13:32,999][DEBUG][action.search.type ] [Hank
McCoy] [dsearch_en_00a858cc8000][0], node[WPUPrpBDT-G5jAgCDdxZXg],
[P], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@628b54f4]
org.elasticsearch.transport.RemoteTransportException:
[Xemu][inet[/192.168.5.14:9300]][search/phase/query]
Caused by: org.elasticsearch.indices.IndexMissingException:
[dsearch_en_00a858cc8000] missing
at org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:208)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:377)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:218)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:447)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:438)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:236)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The master gave the following output, trying to send the cluster state
to the node later serving the primary shard:
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,986][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected

The data is still available on a node, but that node isn't serving the
data (I am using 1 replica and 1 shard, local gateway):
The nodes containing data:
192.168.6.5 (6.9G)
192.168.5.14 (28K) -> primary shard
192.168.5.12 (24K) -> secondary shard
192.168.5.8 (4.0K)

Also when I start up my cluster, the data is always recovered from a
primary shard and then streamed to another node. Is this
normal behaviour that the whole data is replicated on startup, instead
of using the data locally available?
Is it possible that the wrong(empty?) node is chosen as the primary
shard, or that the indices are not correctly deleted.

Can you please tell me, if you need some more debugging output and
what classes I can enable logging for.

Best,
Michel

On Tue, Jul 19, 2011 at 11:44 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

It can't really give an error (well, not easily). Its hard since its
considered a single key then, and you need to do the reverse (i.e. find out
which settings are not being used). Possible, but gets complicated when each
module is independent and pluggable.

On Tue, Jul 19, 2011 at 11:10 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,
My bad, multicast was still enabled because of an error in the
configuration file, there was a whitespace missing, "enabled:false"
should have been "enabled: false". Maybe elasticsearch could give out
a warning if the configuration file has an error instead of silently
ignoring it.

Thanks,
Michel

On Mon, Jul 18, 2011 at 7:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Great, and strange...
Don't understand why multicast is still enabled, can you try setting
this in
the config file (just as one setting string):
discovery.zen.ping.multicast.enabled: false
thanks!

On Mon, Jul 18, 2011 at 1:02 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi,
over the last two days I have been restarting the cluster twice, and
no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.
https://gist.github.com/1089067

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple single
server
and
set discovery to TRACE in the logging.yml file and gist it? Multicast
should
be disabled.
I got a bit confused by the last mail. Are things working fine now if
you
don't have the nodes immediately start and then killed after
shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the
loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving
and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at

org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)
at

org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)
at

org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at

org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at

java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at

java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have
elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the
webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling
multicast
should not give those failures. Can you gist your config (it gets
messed
up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to
always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems
to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of
0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow,
and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having
36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,

Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red

Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom

Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel

Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added
{[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added
{[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Michel Conrad) #12

I managed to restore the index while running the cluster by closing
it, then copying the data
over from the node which still holds the data before reopening the index.

This doesn't solve the issue, but at least it offers a possibility to
restore the indices.

On Wed, Jul 20, 2011 at 1:59 PM, stephane
stephane.bastian.dev@gmail.com wrote:

Hi all,

FYI, I simply want to say that I had the same issue 2 weeks ago while
testing 0.17 on an index created with 0.16. Everything worked fine until
I switched back to using 0.16, at which point all indexes were deleted.

At that time I though that the problem came from the fact that I
switched back and forth between ES 0.16/0.17. Looking at this thread it
may indeed be an ES bug.
Note that I had the problem only once. However if it happens again, I'll
post as many detailed info as possible.

Stephane Bastian

On Wed, 2011-07-20 at 10:57 +0200, Michel Conrad wrote:

Hi Shay,

The issue with the empty indices reappeared, I lost an index again.
Yesterday I have been programmatically deleting some indices, which
were correctly deleted.
After restarting elasticsearch another index
(dsearch_en_00a858cc8000), which I didn't delete, has been empty.

Grepping over the log files gives the following error.
[2011-07-19 19:13:32,999][DEBUG][action.search.type ] [Hank
McCoy] [dsearch_en_00a858cc8000][0], node[WPUPrpBDT-G5jAgCDdxZXg],
[P], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@628b54f4]
org.elasticsearch.transport.RemoteTransportException:
[Xemu][inet[/192.168.5.14:9300]][search/phase/query]
Caused by: org.elasticsearch.indices.IndexMissingException:
[dsearch_en_00a858cc8000] missing
at org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:208)
at org.elasticsearch.search.SearchService.createContext(SearchService.java:377)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:218)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:447)
at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:438)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:236)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The master gave the following output, trying to send the cluster state
to the node later serving the primary shard:
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,986][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected

The data is still available on a node, but that node isn't serving the
data (I am using 1 replica and 1 shard, local gateway):
The nodes containing data:
192.168.6.5 (6.9G)
192.168.5.14 (28K) -> primary shard
192.168.5.12 (24K) -> secondary shard
192.168.5.8 (4.0K)

Also when I start up my cluster, the data is always recovered from a
primary shard and then streamed to another node. Is this
normal behaviour that the whole data is replicated on startup, instead
of using the data locally available?
Is it possible that the wrong(empty?) node is chosen as the primary
shard, or that the indices are not correctly deleted.

Can you please tell me, if you need some more debugging output and
what classes I can enable logging for.

Best,
Michel

On Tue, Jul 19, 2011 at 11:44 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

It can't really give an error (well, not easily). Its hard since its
considered a single key then, and you need to do the reverse (i.e. find out
which settings are not being used). Possible, but gets complicated when each
module is independent and pluggable.

On Tue, Jul 19, 2011 at 11:10 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,
My bad, multicast was still enabled because of an error in the
configuration file, there was a whitespace missing, "enabled:false"
should have been "enabled: false". Maybe elasticsearch could give out
a warning if the configuration file has an error instead of silently
ignoring it.

Thanks,
Michel

On Mon, Jul 18, 2011 at 7:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Great, and strange...
Don't understand why multicast is still enabled, can you try setting
this in
the config file (just as one setting string):
discovery.zen.ping.multicast.enabled: false
thanks!

On Mon, Jul 18, 2011 at 1:02 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi,
over the last two days I have been restarting the cluster twice, and
no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.
https://gist.github.com/1089067

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple single
server
and
set discovery to TRACE in the logging.yml file and gist it? Multicast
should
be disabled.
I got a bit confused by the last mail. Are things working fine now if
you
don't have the nodes immediately start and then killed after
shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using 'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting the
loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes leaving
and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],}, reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at

org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)
at

org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)
at

org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at

org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at

java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at

java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file: https://gist.github.com/1084292

The reason I am using unicast, is because I want to have
elasticsearch
listen on two interfaces, and only bind the communication between
the nodes to the local interface, the other interface is for the
webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill -9"
on every server.
At first I have not been using the shutdown API and was killing the
nodes sequentially by calling kill -9, maybe its here that things got
messed up. Could it be that my indices got lost and that the cluster
state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using multiple
rivers creating new indices on the fly, which are starting during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since disabling
multicast
should not give those failures. Can you gist your config (it gets
messed
up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems to
always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices seems
to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version of
0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they are
automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster. When the
40 nodes are
reached, the recovery is starting. All the indices turn to yellow,
and
then slowly to green.
The problem is, some indices quickly turned to green and lost all
their data. Greping
for the concerned indices over the log file does not reveal anything.

By looking at the data directory, I found some servers containing
still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G, having
36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added {[Kine,

Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red

Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom

Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible
Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel

Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added
{[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added
{[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],}, reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search moves all
replicas to different nodes, thus causing lots of traffic. Is this
normal behavior and could this cause the error above?

Best,
Michel


(Shay Banon) #13

Hi,

Can you set gateway.local to TRACE next time you restart? Lets see why it
decides to allocate those empty shards to the wrong nodes... .

Also, is there a chance you can upgrade to 0.17.1? It will be simpler to
try and help and debug this. If this happen consistently (to a degree), I
would love to have a more realtime chat (IRC?) with you in recreating and
trying to understand why it happens.

On Wed, Jul 20, 2011 at 6:56 PM, Michel Conrad <
michel.conrad@trendiction.com> wrote:

I managed to restore the index while running the cluster by closing
it, then copying the data
over from the node which still holds the data before reopening the index.

This doesn't solve the issue, but at least it offers a possibility to
restore the indices.

On Wed, Jul 20, 2011 at 1:59 PM, stephane
stephane.bastian.dev@gmail.com wrote:

Hi all,

FYI, I simply want to say that I had the same issue 2 weeks ago while
testing 0.17 on an index created with 0.16. Everything worked fine until
I switched back to using 0.16, at which point all indexes were deleted.

At that time I though that the problem came from the fact that I
switched back and forth between ES 0.16/0.17. Looking at this thread it
may indeed be an ES bug.
Note that I had the problem only once. However if it happens again, I'll
post as many detailed info as possible.

Stephane Bastian

On Wed, 2011-07-20 at 10:57 +0200, Michel Conrad wrote:

Hi Shay,

The issue with the empty indices reappeared, I lost an index again.
Yesterday I have been programmatically deleting some indices, which
were correctly deleted.
After restarting elasticsearch another index
(dsearch_en_00a858cc8000), which I didn't delete, has been empty.

Grepping over the log files gives the following error.
[2011-07-19 19:13:32,999][DEBUG][action.search.type ] [Hank
McCoy] [dsearch_en_00a858cc8000][0], node[WPUPrpBDT-G5jAgCDdxZXg],
[P], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@628b54f4]
org.elasticsearch.transport.RemoteTransportException:
[Xemu][inet[/192.168.5.14:9300]][search/phase/query]
Caused by: org.elasticsearch.indices.IndexMissingException:
[dsearch_en_00a858cc8000] missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:208)

    at

org.elasticsearch.search.SearchService.createContext(SearchService.java:377)

    at

org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:218)

    at

org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:447)

    at

org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:438)

    at

org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:236)

    at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

    at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

    at java.lang.Thread.run(Thread.java:662)

The master gave the following output, trying to send the cluster state
to the node later serving the primary shard:
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,986][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected

The data is still available on a node, but that node isn't serving the
data (I am using 1 replica and 1 shard, local gateway):
The nodes containing data:
192.168.6.5 (6.9G)
192.168.5.14 (28K) -> primary shard
192.168.5.12 (24K) -> secondary shard
192.168.5.8 (4.0K)

Also when I start up my cluster, the data is always recovered from a
primary shard and then streamed to another node. Is this
normal behaviour that the whole data is replicated on startup, instead
of using the data locally available?
Is it possible that the wrong(empty?) node is chosen as the primary
shard, or that the indices are not correctly deleted.

Can you please tell me, if you need some more debugging output and
what classes I can enable logging for.

Best,
Michel

On Tue, Jul 19, 2011 at 11:44 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

It can't really give an error (well, not easily). Its hard since its
considered a single key then, and you need to do the reverse (i.e.
find out

which settings are not being used). Possible, but gets complicated
when each

module is independent and pluggable.

On Tue, Jul 19, 2011 at 11:10 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,
My bad, multicast was still enabled because of an error in the
configuration file, there was a whitespace missing, "enabled:false"
should have been "enabled: false". Maybe elasticsearch could give out
a warning if the configuration file has an error instead of silently
ignoring it.

Thanks,
Michel

On Mon, Jul 18, 2011 at 7:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Great, and strange...
Don't understand why multicast is still enabled, can you try
setting

this in
the config file (just as one setting string):
discovery.zen.ping.multicast.enabled: false
thanks!

On Mon, Jul 18, 2011 at 1:02 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi,
over the last two days I have been restarting the cluster twice,
and

no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.
https://gist.github.com/1089067

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple
single

server
and
set discovery to TRACE in the logging.yml file and gist it?
Multicast

should
be disabled.
I got a bit confused by the last mail. Are things working fine
now if

you
don't have the nodes immediately start and then killed after
shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although
this

doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a
loop,

in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using '
http://localhost:9200/_shutdown"

the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was
shutting

down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting
the

loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes
leaving

and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response

with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],},
reason:

zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after
setting

shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at

org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)

at

org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)

at

org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)

at

org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)

at

java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)

at

java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)

at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)

at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)

at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)

at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file:
https://gist.github.com/1084292

The reason I am using unicast, is because I want to have
elasticsearch
listen on two interfaces, and only bind the communication
between

the nodes to the local interface, the other interface is for the
webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which
calls

"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs kill
-9"

on every server.
At first I have not been using the shutdown API and was killing
the

nodes sequentially by calling kill -9, maybe its here that
things got

messed up. Could it be that my indices got lost and that the
cluster

state got inconsistent, when the cluster tries to relocate these
indices, during the killing of the nodes? I am also using
multiple

rivers creating new indices on the fly, which are starting
during

cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since
disabling

multicast
should not give those failures. Can you gist your config (it
gets

messed
up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it seems
to

always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some indices
turned up empty (0 docs and green). All data in these indices
seems

to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version
of

0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they
are

automatically already being started during the recovery process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster.
When the

40 nodes are
reached, the recovery is starting. All the indices turn to
yellow,

and
then slowly to green.
The problem is, some indices quickly turned to green and lost
all

their data. Greping
for the concerned indices over the log file does not reveal
anything.

By looking at the data directory, I found some servers
containing

still data in /elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G,
having

36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about
multicast,

altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at
sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)

at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)

at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)

at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)

at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)

at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)

at

org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)

at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)

at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)

at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at
sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)

at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)

at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)

at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)

at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)

at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)

at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)

at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)

at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)

at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)

at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)

at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)

at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added
{[Kine,

Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300
]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300
]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300
]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300
]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,

Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red

Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300
]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300
]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom

Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300
]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300
]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible

Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300
]],[Nathaniel

Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300
]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},

reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added
{[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added
{[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],},
reason:

zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search
moves all

replicas to different nodes, thus causing lots of traffic. Is
this

normal behavior and could this cause the error above?

Best,
Michel


(Michel Conrad) #14

Hi Shay,

I've set the gateway.local to TRACE and restarted. All indices came
up, although it always takes some hours to go from yellow to green.
I will update to 0.17.1 on Monday and will do some more testing and I
would be glad to chat with you if I manage to recreate the issue.
I don't know if it means something, but on some servers there are 2
node directories. (The second one is only some KB and I think
it could have been created during the restarting of the nodes some
weeks ago, when a node has not been shutdown correctly).

Thanks,
Michel

On Thu, Jul 21, 2011 at 7:59 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Hi,
Can you set gateway.local to TRACE next time you restart? Lets see why it
decides to allocate those empty shards to the wrong nodes... .
Also, is there a chance you can upgrade to 0.17.1? It will be simpler to
try and help and debug this. If this happen consistently (to a degree), I
would love to have a more realtime chat (IRC?) with you in recreating and
trying to understand why it happens.

On Wed, Jul 20, 2011 at 6:56 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

I managed to restore the index while running the cluster by closing
it, then copying the data
over from the node which still holds the data before reopening the index.

This doesn't solve the issue, but at least it offers a possibility to
restore the indices.

On Wed, Jul 20, 2011 at 1:59 PM, stephane
stephane.bastian.dev@gmail.com wrote:

Hi all,

FYI, I simply want to say that I had the same issue 2 weeks ago while
testing 0.17 on an index created with 0.16. Everything worked fine until
I switched back to using 0.16, at which point all indexes were deleted.

At that time I though that the problem came from the fact that I
switched back and forth between ES 0.16/0.17. Looking at this thread it
may indeed be an ES bug.
Note that I had the problem only once. However if it happens again, I'll
post as many detailed info as possible.

Stephane Bastian

On Wed, 2011-07-20 at 10:57 +0200, Michel Conrad wrote:

Hi Shay,

The issue with the empty indices reappeared, I lost an index again.
Yesterday I have been programmatically deleting some indices, which
were correctly deleted.
After restarting elasticsearch another index
(dsearch_en_00a858cc8000), which I didn't delete, has been empty.

Grepping over the log files gives the following error.
[2011-07-19 19:13:32,999][DEBUG][action.search.type ] [Hank
McCoy] [dsearch_en_00a858cc8000][0], node[WPUPrpBDT-G5jAgCDdxZXg],
[P], s[STARTED]: Failed to execute
[org.elasticsearch.action.search.SearchRequest@628b54f4]
org.elasticsearch.transport.RemoteTransportException:
[Xemu][inet[/192.168.5.14:9300]][search/phase/query]
Caused by: org.elasticsearch.indices.IndexMissingException:
[dsearch_en_00a858cc8000] missing
at
org.elasticsearch.indices.InternalIndicesService.indexServiceSafe(InternalIndicesService.java:208)
at
org.elasticsearch.search.SearchService.createContext(SearchService.java:377)
at
org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:218)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:447)
at
org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:438)
at
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:236)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

The master gave the following output, trying to send the cluster state
to the node later serving the primary shard:
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,985][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected
[2011-07-19 19:13:26,986][DEBUG][discovery.zen.publish ] [Cloud]
failed to send cluster state to
[[Xemu][WPUPrpBDT-G5jAgCDdxZXg][inet[/192.168.5.14:9300]]], should be
detected as failed soon...
org.elasticsearch.transport.NodeDisconnectedException:
[Xemu][inet[/192.168.5.14:9300]][discovery/zen/publish] disconnected

The data is still available on a node, but that node isn't serving the
data (I am using 1 replica and 1 shard, local gateway):
The nodes containing data:
192.168.6.5 (6.9G)
192.168.5.14 (28K) -> primary shard
192.168.5.12 (24K) -> secondary shard
192.168.5.8 (4.0K)

Also when I start up my cluster, the data is always recovered from a
primary shard and then streamed to another node. Is this
normal behaviour that the whole data is replicated on startup, instead
of using the data locally available?
Is it possible that the wrong(empty?) node is chosen as the primary
shard, or that the indices are not correctly deleted.

Can you please tell me, if you need some more debugging output and
what classes I can enable logging for.

Best,
Michel

On Tue, Jul 19, 2011 at 11:44 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

It can't really give an error (well, not easily). Its hard since its
considered a single key then, and you need to do the reverse (i.e.
find out
which settings are not being used). Possible, but gets complicated
when each
module is independent and pluggable.

On Tue, Jul 19, 2011 at 11:10 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,
My bad, multicast was still enabled because of an error in the
configuration file, there was a whitespace missing, "enabled:false"
should have been "enabled: false". Maybe elasticsearch could give
out
a warning if the configuration file has an error instead of silently
ignoring it.

Thanks,
Michel

On Mon, Jul 18, 2011 at 7:15 PM, Shay Banon
shay.banon@elasticsearch.com wrote:

Great, and strange...
Don't understand why multicast is still enabled, can you try
setting
this in
the config file (just as one setting string):
discovery.zen.ping.multicast.enabled: false
thanks!

On Mon, Jul 18, 2011 at 1:02 PM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi,
over the last two days I have been restarting the cluster twice,
and
no indices have been lost. The issue did not reappear.

As for the multicast errors, I gisted a logging file with
discovery:trace enabled.
https://gist.github.com/1089067

Best,
Michel

On Sat, Jul 16, 2011 at 1:23 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

You configuration looks good, can you maybe start a simple
single
server
and
set discovery to TRACE in the logging.yml file and gist it?
Multicast
should
be disabled.
I got a bit confused by the last mail. Are things working fine
now if
you
don't have the nodes immediately start and then killed after
shutdown?

On Friday, July 15, 2011 at 4:00 PM, Michel Conrad wrote:

Hi,

I think I possibly found the source of the problem, although
this
doesn't explain why I am getting these multicast errors.

I have been running elasticsearch on the different servers in a
loop,
in a way that if it would get killed, it would be immediately
restarted.
By shutting down the cluster using
'http://localhost:9200/_shutdown"
the nodes would exit, and be immediately relaunched.
So the cluster was in a state where at the same time, it was
shutting
down and starting up. During this unwanted startup, I
have been killing the individual nodes using kill -9, exiting
the
loop.

Unfortunately I have been loosing further indices, even doing a
correct cluster restart.
I also pasted some further log information with some nodes
leaving
and
immediately rejoining the cluster.

Best,
Michel

[2011-07-15 12:40:35,910][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Ryder][eBF3XoX4SuqabvbzvleOrg][inet[/192.168.6.5:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:35,912][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Phimster,
Ellie][9GdlKjo6R6W0m_UPiZpgpQ][inet[/192.168.6.3:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g
][inet[/192.168.5.1:9300]]])
[2011-07-15 12:40:37,224][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:37,225][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:37,577][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:38,676][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:40,246][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Siryn][LKwg8eL7SyuvXhgqu70Vxg][inet[/192.168.5.6:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/19
2.168.5.1:9300]]])
[2011-07-15 12:40:44,294][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:44,295][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:51,338][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:51,344][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:40:54,317][ESC[33mWARN
ESC[0m][discovery.zen.ping.multicast] [Blade] received ping
response
with no matching id [1]
[2011-07-15 12:46:04,810][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:46:07,920][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Mephisto][eUWVtEr1QnCkEz1bU7O6lA][inet[/192.168.6.2:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:47:35,375][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:47:35,869][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Advisor][J4FU6zSGQmaBrbR0J-Rzxg][inet[/192.168.5.11:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:54:03,654][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:54:04,893][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 12:54:41,639][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][ine
t[/192.168.5.1:9300]]])
[2011-07-15 12:54:43,805][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added
{[Morbius][dQqqzmK_RR2TQIC03wTd5w][inet[/192.168.5.19:9300]],},
reason: zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[
/192.168.5.1:9300]]])
[2011-07-15 12:56:24,793][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][
inet[/192.168.5.1:9300]]])
[2011-07-15 12:56:28,903][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Silver
Fox][gjFx44gMQfqOIdh99hcdAw][inet[/192.168.5.15:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][in
et[/192.168.5.1:9300]]])
[2011-07-15 13:00:26,160][ESC[32mINFO ESC[0m][cluster.service
] [Blade] removed {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:00:32,395][ESC[32mINFO ESC[0m][cluster.service
] [Blade] added {[Blind
Faith][HP9hnGZhR72XnpEJbHNBxw][inet[/192.168.6.16:9300]],},
reason:
zen-disco-receive(from master [[Jumbo
Carnation][5axELonGSWSw3h9Jt8g_0g][inet[/192.168.5.1:9300]]])
[2011-07-15 13:16:26,690][ESC[33mWARN
ESC[0m][index.engine.robin
] [Blade] [dsearch_de_00a858c20000][0] failed to flush after
setting
shard to inactive
org.elasticsearch.index.engine.EngineClosedException:
[dsearch_de_00a858c20000][0] CurrentState[CLOSED]
at

org.elasticsearch.index.engine.robin.RobinEngine.flush(RobinEnginne.java:657)
at

org.elasticsearch.index.engine.robin.RobinEngine.updateIndexingBufferSize(RobinEngine.java:207)
at

org.elasticsearch.indices.memory.IndexingMemoryBufferController$ShardsIndicesStatusChecker.run(IndexingMemoryBufferController.java:147)
at

org.elasticsearch.threadpool.ThreadPool$LoggingRunnable.run(ThreadPool.java:201)
at

java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
at

java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:317)
at
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:150)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$101(ScheduledThreadPoolExecutor.java:98)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.runPeriodic(ScheduledThreadPoolExecutor.java:180)
at

java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:204)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

On FFri, Jul 15, 2011 at 10:46 AM, Michel Conrad
michel.conrad@trendiction.com wrote:

Hi Shay,

I gisted the configuration file:
https://gist.github.com/1084292

The reason I am using unicast, is because I want to have
elasticsearch
listen on two interfaces, and only bind the communication
between
the nodes to the local interface, the other interface is for
the
webserver.

To shutdown the cluster I am using "curl -XPOST
'http://localhost:9200/_shutdown", followed by a script which
calls
"pgrep -f /software/elasticsearch/lib/elasticsearch | xargs
kill -9"
on every server.
At first I have not been using the shutdown API and was killing
the
nodes sequentially by calling kill -9, maybe its here that
things got
messed up. Could it be that my indices got lost and that the
cluster
state got inconsistent, when the cluster tries to relocate
these
indices, during the killing of the nodes? I am also using
multiple
rivers creating new indices on the fly, which are starting
during
cluster startup and still running during cluster shutdown.

Best,
Michel

On Fri, Jul 15, 2011 at 12:49 AM, Shay Banon
shay.banon@elasticsearch.com wrote:

Heya,
First, lest try and get the discovery working, since
disabling
multicast
should not give those failures. Can you gist your config (it
gets
messed
up
in the mail)?
Just ran a quick 40 nodes test with 200 indices and it
seems to
always
recover the data... . How do you shutdown the cluster?
-shay.banon

On Thursday, July 14, 2011 at 7:16 PM, Michel Conrad wrote:

Hi,

Sometimes, when doing a full restart of my cluster, some
indices
turned up empty (0 docs and green). All data in these indices
seems
to
be lost!

My configuration is the following:
version 0.16.4 (snapshot), also occured in a snapshoted version
of
0.16.3
40 nodes, +-100 indices.
I use 1 shard / index and 1 replica.
I am using the local gateway with unicast discovery.
I am using rivers to index new data into the database, and they
are
automatically already being started during the recovery
process.
Indices are also created on the fly.

When I start my cluster, the nodes get added to the cluster.
When the
40 nodes are
reached, the recovery is starting. All the indices turn to
yellow,
and
then slowly to green.
The problem is, some indices quickly turned to green and lost
all
their data. Greping
for the concerned indices over the log file does not reveal
anything.

By looking at the data directory, I found some servers
containing
still data in
/elasticsearch/search/nodes/0/indices/index001858.
The size of this directory is the following:
Server A) 4.0KB
Server B) 1.4GB
Server C) 36KB
Server D) 564MB
Server E) 4.0K
Server F) 946MB
Server G) 28KB

The cluster state has the index allocated on server C + G,
having
36KB
and 28KB, and not f.i. on server B holding 1.4GB.

I was wondering why the data available on the hdd is not being
recovered, and why an empty index is being recovered.

On startup I am also getting some warning messages about
multicast,
altough I am using unicast discovery:
discovery:
zen.ping:
multicast:
enabled:false
unicast:
hosts: 192.168.5.1[9300], 192.168.5.2[9300], 192.168.5.3[9300],
192.168.5.4[9300]

the warning messages I am getting:

[2011-07-14 17:16:54,481][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x1ebe99f8]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at
sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSockettPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChanneels(NettyTransport.java:504)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,482][WARN ][transport.netty ] [Wagner,
Kurt] Exception caught on netty layer [[id: 0x736e788c]]
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:30)
at
sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:480)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.connect(NioClientSocketPipelineSink.java:140)
at

org.elasticsearch.common.netty.channel.socket.nio.NioClientSocketPipelineSink.eventSunk(NioClientSocketPipelineSink.java:103)
at

org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendDownstream(DefaultChannelPipeline.java:555)
at

org.elasticsearch.common.netty.channel.Channels.connect(Channels.java:541)
at

org.elasticsearch.common.netty.channel.AbstractChannel.connect(AbstractChannel.java:218)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:227)
at

org.elasticsearch.common.netty.bootstrap.ClientBootstrap.connect(ClientBootstrap.java:188)
at

org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:507)
at

org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:475)
at

org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:126)
at

org.elasticsearch.discovery.zen.ping.unicast.UnicastZenPing$$3.run(UnicastZenPing.java:198)
at

java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-07-14 17:16:54,625][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:54,676][INFO ][cluster.service ] [Wagner,
Kurt] detected_master [Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]], added
{[Kine,

Benedict][0gST1OnzQbSrk5VMoUVmTw][inet[/192.168.6.9:9300]],[Matador][K2cbli_9QRyUO4fRwr21Xg][inet[/192.168.5.2:9300]],[Death-Stalker][6X5mT0NAQ2aoHellP-5U0A][inet[/192.168.5.8:9300]],[Briquette][GaKPrDx-RaqBudM2ZOuj1Q][inet[/192.168.5.10:9300]],[Grenade][wDpt3uSVReuPePjgiMdMXg][inet[/192.168.5.4:9300]],[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]],[Red

Guardian][HhnTAOKbRdatVMGBrBEKTw][inet[/192.168.6.19:9300]],[Dragonwing][UZxd5EIuRMCy1wT6sExdfQ][inet[/192.168.6.20:9300]],[Hobgoblin][GpfT4JseQpWvrTYS62KsEw][inet[/192.168.5.12:9300]],[Tom

Cassidy][LscBtoZVQ9m5PF8yM5c1Fw][inet[/192.168.5.14:9300]],[Bob][nmw1mj92SKCLIxJ3ZxQ4ww][inet[/192.168.6.18:9300]],[Sligguth][UoqEO4uAR_ORTnIbx-508Q][inet[/192.168.5.15:9300]],[Invisible

Woman][kMRXVw57QgKA_UkE5_OfjQ][inet[/192.168.5.7:9300]],[Nathaniel

Richards][fmd7wh52Sc6FhB9zTxCj3Q][inet[/192.168.6.1:9300]],[Xavin][_oDz7oXhT8SiwTlZkTqZBg][inet[/192.168.5.1:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:16:55,061][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,365][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,635][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:55,943][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,953][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:55,961][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,176][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:56,537][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,798][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,888][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:56,891][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,000][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,048][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,246][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,630][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:57,797][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,801][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:57,991][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,263][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,372][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:58,768][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,789][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:58,983][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,186][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:16:59,824][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:16:59,839][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [2]
[2011-07-14 17:17:00,059][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,253][WARN ][discovery.zen.ping.multicast]
[Wagner, Kurt] received ping response with no matching id [1]
[2011-07-14 17:17:00,522][INFO ][cluster.service ] [Wagner,
Kurt] added
{[Mangle][DZCqZ0yIR8mb7_HC1Kbouw][inet[/192.168.6.11:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,524][INFO ][cluster.service ]] [Wagner,
Kurt] added
{[Kala][cV775zgaRke6KCz1HZe-fA][inet[/192.168.6.14:9300]],},
reason: zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])
[2011-07-14 17:17:00,525][INFO ][cluster.service ] [Wagner,
Kurt] added {[Tiboldt,
Maynard][orYVQrKyQdiDK-FdB0TaFw][inet[/192.168.6.10:9300]],},
reason:
zen-disco-receive(from master [[Cage,
Luke][613gLOsUSbGjE5hfZYWutQ][inet[/192.168.5.3:9300]]])

I also observe that on each cluster restart, elastic search
moves all
replicas to different nodes, thus causing lots of traffic. Is
this
normal behavior and could this cause the error above?

Best,
Michel


(system) #15