Failed to start shard


(gnulinux) #1

Hi

I am evaluating ElasticSearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]


(Shay Banon) #2

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating ElasticSearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]


(Viji Nair) #3

Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes. The
node discovery happens properly, but after adding index (this time followed
the twitter example) and putting some data, a subsequent reboot is giving
the same issue. Please find the steps I have followed.

  1. Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

  1. This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

  1. Installed Elastic Search 0.17.9 and Service Wrapper.

  2. Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

  1. Started both the nodes and checked the status, both were up. Verified the
    log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

  1. Added some data and restarted the nodes, the cluster status is red and
    log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating ElasticSearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]


(Shay Banon) #4

Are you sure that you don't delete the index content between restarts?

On Fri, Oct 21, 2011 at 10:09 PM, Viji Nair viji@linux.com wrote:

Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes. The
node discovery happens properly, but after adding index (this time followed
the twitter example) and putting some data, a subsequent reboot is giving
the same issue. Please find the steps I have followed.

  1. Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

  1. This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

  1. Installed Elastic Search 0.17.9 and Service Wrapper.

  2. Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"

node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"

node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

  1. Started both the nodes and checked the status, both were up. Verified
    the log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

  1. Added some data and restarted the nodes, the cluster status is red and
    log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't

at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't

at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating ElasticSearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]


(Viji Nair) #5

Yes, I am sure. Deleted the old index, reconfigured freshly as explained,
added data, tested , and restarted the nodes. No deletion in-between.

On Sat, Oct 22, 2011 at 5:23 AM, Shay Banon kimchy@gmail.com wrote:

Are you sure that you don't delete the index content between restarts?

On Fri, Oct 21, 2011 at 10:09 PM, Viji Nair viji@linux.com wrote:

Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes. The
node discovery happens properly, but after adding index (this time followed
the twitter example) and putting some data, a subsequent reboot is giving
the same issue. Please find the steps I have followed.

  1. Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

  1. This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

  1. Installed Elastic Search 0.17.9 and Service Wrapper.

  2. Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"

node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"

node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

  1. Started both the nodes and checked the status, both were up. Verified
    the log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

  1. Added some data and restarted the nodes, the cluster status is red and
    log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't

at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't

at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating ElasticSearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]


(Shay Banon) #6

The error comes when a shard is allocated on a node where it expects the
index to exists, but its not there. Maybe you can somehow try and recreate
it locally (you can easily start 2 nodes locally on your machine), see if it
happens then. If so, gist the steps you use and I can check it.

On Sat, Oct 22, 2011 at 5:26 AM, Viji Nair viji@linux.com wrote:

Yes, I am sure. Deleted the old index, reconfigured freshly as explained,
added data, tested , and restarted the nodes. No deletion in-between.

On Sat, Oct 22, 2011 at 5:23 AM, Shay Banon kimchy@gmail.com wrote:

Are you sure that you don't delete the index content between restarts?

On Fri, Oct 21, 2011 at 10:09 PM, Viji Nair viji@linux.com wrote:

Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes. The
node discovery happens properly, but after adding index (this time followed
the twitter example) and putting some data, a subsequent reboot is giving
the same issue. Please find the steps I have followed.

  1. Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

  1. This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux

  1. Installed Elastic Search 0.17.9 and Service Wrapper.

  2. Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"

node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"

node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

  1. Started both the nodes and checked the status, both were up. Verified
    the log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

  1. Added some data and restarted the nodes, the cluster status is red and
    log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't

at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't

at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating ElasticSearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]


(Viji Nair) #7

I am not sure what exactly went wrong. I upgraded to the latest version of
ES today (0.18.1) and everything started working fine, even after multiple
stop/start of the instances my cluster seems stable in all aspects.

  1. Downloaded the latest ES and Service Wrapper binaries.
  2. Copied the config file form old setup
  3. Started the cluster and added some data
  4. Restarted the instances
  5. Cluster is stable and "green"

Cheers,
Viji

On Sun, Oct 23, 2011 at 5:45 AM, Shay Banon kimchy@gmail.com wrote:

The error comes when a shard is allocated on a node where it expects the
index to exists, but its not there. Maybe you can somehow try and recreate
it locally (you can easily start 2 nodes locally on your machine), see if it
happens then. If so, gist the steps you use and I can check it.

On Sat, Oct 22, 2011 at 5:26 AM, Viji Nair viji@linux.com wrote:

Yes, I am sure. Deleted the old index, reconfigured freshly as explained,
added data, tested , and restarted the nodes. No deletion in-between.

On Sat, Oct 22, 2011 at 5:23 AM, Shay Banon kimchy@gmail.com wrote:

Are you sure that you don't delete the index content between restarts?

On Fri, Oct 21, 2011 at 10:09 PM, Viji Nair viji@linux.com wrote:

Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes.
The node discovery happens properly, but after adding index (this time
followed the twitter example) and putting some data, a subsequent reboot is
giving the same issue. Please find the steps I have followed.

  1. Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

  1. This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux

  1. Installed Elastic Search 0.17.9 and Service Wrapper.

  2. Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"

node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"

node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

  1. Started both the nodes and checked the status, both were up. Verified
    the log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

  1. Added some data and restarted the nodes, the cluster status is red
    and log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't

at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't

at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating ElasticSearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]


(system) #8