I configured elastic search to run on two nodes, both are of type data/master, in unicast mode. I then wrote my program to initialize a transport client to connect to both nodes. For some reason, either due to network was slow or the node itself was dead, anyway one node was failed. Meanwhile elasticsearch was executing a scheduled job of indexing a great amount of data to the cluster. The transport client started to repeatedly complain one node was unavailable. The whole cluster then was messed up. Below is one sample of the failure message in log I got after I bounced the cluster. What can I do to avoid this from happening?
WARNING: [Blackout] [coverage-elastic1345266122391][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [coverage-elastic1345266122391][0] shard allocated for local recovery (post api), should exists, but doesn't
at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
My understanding is ElasticSearch is built to keep this from happening, i.e., when some node is dead, the other node should be able to automatically pick up the master role. When the other node is resurrected, or the whole cluster is bounced, that node will be automatically recovered by the healthy node. Am I wrong?
I configured Elasticsearch to run on two nodes, both are of type
data/master, in unicast mode. I then wrote my program to initialize
a transport client to connect to both nodes.
[...]
WARNING: [Blackout] [coverage-elastic1345266122391][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[coverage-elastic1345266122391][0] shard allocated for local recovery (post
api), should exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Look in the logs before you restarted the nodes for anything related
to "added", "removed", "ping". We need to be able to piece together
the sequence of events. How much data are you indexing? How many
client threads? Bulk or one-doc-at-time?
We are indexing a great amount of data through Elasticsearch, consider it around 20 gb. There is one client thread per elastic node, so we have two clients. We used zookeeper to synchronize loading of these 2 client threads, so any time there is only one client writing. But note that this one client can be accessed by multiple threads for indexing different index. And yes, we are using bulk load writer, which commit to the cluster per 10000 records.
I configured Elasticsearch to run on two nodes, both are of type
data/master, in unicast mode. I then wrote my program to initialize
a transport client to connect to both nodes.
[...]
WARNING: [Blackout] [coverage-elastic1345266122391][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[coverage-elastic1345266122391][0] shard allocated for local recovery (post
api), should exists, but doesn't
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Look in the logs before you restarted the nodes for anything related
to "added", "removed", "ping". We need to be able to piece together
the sequence of events. How much data are you indexing? How many
client threads? Bulk or one-doc-at-time?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.