Unexpected node failure by using transport client


(shengcer) #1

I configured elastic search to run on two nodes, both are of type data/master, in unicast mode. I then wrote my program to initialize a transport client to connect to both nodes. For some reason, either due to network was slow or the node itself was dead, anyway one node was failed. Meanwhile elasticsearch was executing a scheduled job of indexing a great amount of data to the cluster. The transport client started to repeatedly complain one node was unavailable. The whole cluster then was messed up. Below is one sample of the failure message in log I got after I bounced the cluster. What can I do to avoid this from happening?

WARNING: [Blackout] [coverage-elastic1345266122391][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [coverage-elastic1345266122391][0] shard allocated for local recovery (post api), should exists, but doesn't
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

My understanding is ElasticSearch is built to keep this from happening, i.e., when some node is dead, the other node should be able to automatically pick up the master role. When the other node is resurrected, or the whole cluster is bounced, that node will be automatically recovered by the healthy node. Am I wrong?


(Drew Raines) #2

shengcer wrote:

I configured elastic search to run on two nodes, both are of type
data/master, in unicast mode. I then wrote my program to initialize
a transport client to connect to both nodes.

[...]

WARNING: [Blackout] [coverage-elastic1345266122391][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[coverage-elastic1345266122391][0] shard allocated for local recovery (post
api), should exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Look in the logs before you restarted the nodes for anything related
to "added", "removed", "ping". We need to be able to piece together
the sequence of events. How much data are you indexing? How many
client threads? Bulk or one-doc-at-time?

-Drew

--


(shengcer) #3

Hi Drew,

We are indexing a great amount of data through elastic search, consider it around 20 gb. There is one client thread per elastic node, so we have two clients. We used zookeeper to synchronize loading of these 2 client threads, so any time there is only one client writing. But note that this one client can be accessed by multiple threads for indexing different index. And yes, we are using bulk load writer, which commit to the cluster per 10000 records.

Sent from my iPhone

On Aug 24, 2012, at 9:48 AM, "Drew Raines-2 [via ElasticSearch Users]"ml-node+s115913n4022109h7@n3.nabble.com wrote:

shengcer wrote:

I configured elastic search to run on two nodes, both are of type
data/master, in unicast mode. I then wrote my program to initialize
a transport client to connect to both nodes.

[...]

WARNING: [Blackout] [coverage-elastic1345266122391][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[coverage-elastic1345266122391][0] shard allocated for local recovery (post
api), should exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)

Look in the logs before you restarted the nodes for anything related
to "added", "removed", "ping". We need to be able to piece together
the sequence of events. How much data are you indexing? How many
client threads? Bulk or one-doc-at-time?

-Drew

--

If you reply to this email, your message will be added to the discussion below:
http://elasticsearch-users.115913.n3.nabble.com/Unexpected-node-failure-by-using-transport-client-tp4021973p4022109.html
To unsubscribe from Unexpected node failure by using transport client, click here.
NAML


(system) #4