Failed to start shard

gnulinux · October 21, 2011, 9:53am

Hi

I am evaluating ElasticSearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

kimchy · October 21, 2011, 6:27pm

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating Elasticsearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Viji_Nair · October 21, 2011, 8:09pm

Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes. The
node discovery happens properly, but after adding index (this time followed
the twitter example) and putting some data, a subsequent reboot is giving
the same issue. Please find the steps I have followed.

Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

Installed Elastic Search 0.17.9 and Service Wrapper.
Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Started both the nodes and checked the status, both were up. Verified the
log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Added some data and restarted the nodes, the cluster status is red and
log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating Elasticsearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

kimchy · October 21, 2011, 11:53pm

Are you sure that you don't delete the index content between restarts?

On Fri, Oct 21, 2011 at 10:09 PM, Viji Nair viji@linux.com wrote:

Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes. The
node discovery happens properly, but after adding index (this time followed
the twitter example) and putting some data, a subsequent reboot is giving
the same issue. Please find the steps I have followed.

Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

Installed Elastic Search 0.17.9 and Service Wrapper.

Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"

node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"

node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Started both the nodes and checked the status, both were up. Verified
the log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Added some data and restarted the nodes, the cluster status is red and
log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating Elasticsearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Viji_Nair · October 22, 2011, 3:26am

Yes, I am sure. Deleted the old index, reconfigured freshly as explained,
added data, tested , and restarted the nodes. No deletion in-between.

On Sat, Oct 22, 2011 at 5:23 AM, Shay Banon kimchy@gmail.com wrote:

Are you sure that you don't delete the index content between restarts?

On Fri, Oct 21, 2011 at 10:09 PM, Viji Nair viji@linux.com wrote:
Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes. The
node discovery happens properly, but after adding index (this time followed
the twitter example) and putting some data, a subsequent reboot is giving
the same issue. Please find the steps I have followed.

Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC 2011
x86_64 x86_64 x86_64 GNU/Linux

Installed Elastic Search 0.17.9 and Service Wrapper.

Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"

node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"

node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Started both the nodes and checked the status, both were up. Verified
the log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Added some data and restarted the nodes, the cluster status is red and
log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1] sending
failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating Elasticsearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

kimchy · October 23, 2011, 12:15am

The error comes when a shard is allocated on a node where it expects the
index to exists, but its not there. Maybe you can somehow try and recreate
it locally (you can easily start 2 nodes locally on your machine), see if it
happens then. If so, gist the steps you use and I can check it.

On Sat, Oct 22, 2011 at 5:26 AM, Viji Nair viji@linux.com wrote:

Yes, I am sure. Deleted the old index, reconfigured freshly as explained,
added data, tested , and restarted the nodes. No deletion in-between.

On Sat, Oct 22, 2011 at 5:23 AM, Shay Banon kimchy@gmail.com wrote:
Are you sure that you don't delete the index content between restarts?

On Fri, Oct 21, 2011 at 10:09 PM, Viji Nair viji@linux.com wrote:
Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes. The
node discovery happens properly, but after adding index (this time followed
the twitter example) and putting some data, a subsequent reboot is giving
the same issue. Please find the steps I have followed.

Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux

Installed Elastic Search 0.17.9 and Service Wrapper.

Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"

node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"

node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Started both the nodes and checked the status, both were up. Verified
the log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Added some data and restarted the nodes, the cluster status is red and
log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating Elasticsearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Viji_Nair · October 27, 2011, 11:09am

I am not sure what exactly went wrong. I upgraded to the latest version of
ES today (0.18.1) and everything started working fine, even after multiple
stop/start of the instances my cluster seems stable in all aspects.

Downloaded the latest ES and Service Wrapper binaries.
Copied the config file form old setup
Started the cluster and added some data
Restarted the instances
Cluster is stable and "green"

Cheers,
Viji

On Sun, Oct 23, 2011 at 5:45 AM, Shay Banon kimchy@gmail.com wrote:

The error comes when a shard is allocated on a node where it expects the
index to exists, but its not there. Maybe you can somehow try and recreate
it locally (you can easily start 2 nodes locally on your machine), see if it
happens then. If so, gist the steps you use and I can check it.

On Sat, Oct 22, 2011 at 5:26 AM, Viji Nair viji@linux.com wrote:
Yes, I am sure. Deleted the old index, reconfigured freshly as explained,
added data, tested , and restarted the nodes. No deletion in-between.

On Sat, Oct 22, 2011 at 5:23 AM, Shay Banon kimchy@gmail.com wrote:
Are you sure that you don't delete the index content between restarts?

On Fri, Oct 21, 2011 at 10:09 PM, Viji Nair viji@linux.com wrote:
Hi,

Yes, I missed it. But I don't know how, the cluster API was reporting
"green" and showing the total number of nodes as "2"

Now, I changed to zen discovery and deleted all the existing indexes.
The node discovery happens properly, but after adding index (this time
followed the twitter example) and putting some data, a subsequent reboot is
giving the same issue. Please find the steps I have followed.

Installed Java

java -version

java version "1.6.0_29"
Java(TM) SE Runtime Environment (build 1.6.0_29-b11)
Java HotSpot(TM) 64-Bit Server VM (build 20.4-b02, mixed mode)

This is an ubuntu 64 bit machine (11.04)

uname -a

Linux ubu-ser 2.6.38-8-generic #42-Ubuntu SMP Mon Apr 11 03:31:24 UTC
2011 x86_64 x86_64 x86_64 GNU/Linux

Installed Elastic Search 0.17.9 and Service Wrapper.

Configured a two node cluster

Node01 Configuration*

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

cluster:
name: gnulinux

node.name: "Ubu2"

node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Node02 Configuration

#cat /root/elasticsearch-0.17.8/config/elasticsearch.yml
cluster:
name: gnulinux

node.name: "Ubu1"

node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true

discovery:
zen:
ping_timeout: 30s
ping:
multicast:
enabled: false
unicast:
enabled: true
hosts: 192.168.2.10, 192.168.2.11
fd:
ping_retries: 10
ping_interval: 5s
ping_timeout: 30s

Started both the nodes and checked the status, both were up. Verified
the log file as well. Everything was fine till this step.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "green",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 0
}

Added some data and restarted the nodes, the cluster status is red
and log file is giving the same error.

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5
}

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

{
"cluster_name" : "gnulinux",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 2,
"number_of_data_nodes" : 2,
"active_primary_shards" : 0,
"active_shards" : 0,
"relocating_shards" : 0,
"initializing_shards" : 5,
"unassigned_shards" : 5

[2011-10-22 01:18:46,703][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,704][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][0], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][0] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,725][WARN ][indices.cluster ] [Ubu1]
[twitter][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][1] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,736][WARN ][indices.cluster ] [Ubu1]
[twitter][4] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[twitter][4] shard allocated for local recovery (post api), should exists,
but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:99)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:179)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][4], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][4] shard allocated for local
recovery (post api), should exists, but doesn't]]]
[2011-10-22 01:18:46,737][WARN ][cluster.action.shard ] [Ubu1]
sending failed shard for [twitter][1], node[fgELpN11R6m2XbIKdLHYgg], [P],
s[INITIALIZING], reason [Failed to start shard, message
[IndexShardGatewayRecoveryException[[twitter][1] shard allocated for local
recovery (post api), should exists, but doesn't]]]

Thanks
Viji

On Fri, Oct 21, 2011 at 11:57 PM, Shay Banon kimchy@gmail.com wrote:

Are you sure the two nodes find each other? The configuration you have
configure jgroups for discovery, which was removed in version 0.6 ....

On Fri, Oct 21, 2011 at 11:53 AM, gnulinux vijivijayakumar@gmail.comwrote:

Hi

I am evaluating Elasticsearch (0.17.8) for a spatial search platform.
I was able to setup a two-node cluster and everything was working
fine. But after rebooting both the nodes, I am getting the following
error on both.

[2011-10-19 06:02:45,243][WARN ][indices.cluster ] [linux
Ubu2] [books][1] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[books][1] shard allocated for local recovery (post api), should
exists, but doesn't
at

org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:
99)
at org.elasticsearch.index.gateway.IndexShardGatewayService
$1.run(IndexShardGatewayService.java:179)
at

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
1110)
at java.util.concurrent.ThreadPoolExecutor
$Worker.run(ThreadPoolExecutor.java:603)
at java.lang.Thread.run(Thread.java:679)

Config Files:

Node01 (Master)

cluster:
name: gnulinux

node.name: "linux Ubu2"
node.master: true
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.10
publishHost: 192.168.2.10

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.10
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Node02:

cluster:
name: gnulinux

node.name: "linux Ubu1"
node.master: false
node.data: true
node.rack: rack01

network:
bindHost: 192.168.2.11
publishHost: 192.168.2.11

index.engine.robin.refreshInterval: -1
index.gateway.snapshot_interval: -1
index.gateway.type: local
index.number_of_shards: 5
index.number_of_replicas: 1

gateway.recover_after_nodes: 2
gateway.recover_after_time: 5m
gateway.expected_nodes: 2
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
indices.recovery.concurrent_streams: 5

index:
store:
fs:
memory:
enabled: true
discovery:
jgroups:
config: tcp
bind_port: 9700
bind_address: 192.168.2.11
tcpping:
initial_hosts: 192.168.2.10[9700], 192.168.2.11[9700]

Topic		Replies	Views
Failed to start shard Elasticsearch	1	249	July 6, 2017
Cluster crashed Elasticsearch	9	463	July 6, 2017
Disappearing Shards Elasticsearch	10	414	July 6, 2017
0.19.10 - cluster wedged, most operations failing Elasticsearch	4	480	July 6, 2017
Very weird ES Cluster state problem! Elasticsearch	8	546	July 6, 2017

Failed to start shard

java -version

uname -a

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

java -version

uname -a

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

java -version

uname -a

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

java -version

uname -a

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

java -version

uname -a

cat /root/elasticsearch-0.17.8/config/elasticsearch.yml

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.10:9200/_cluster/health?pretty=true'

curl -XGET 'http://192.168.2.11:9200/_cluster/health?pretty=true'

Related topics