Failed to start shard


(shengcer) #1

Hi,

My elasticsearch instance has been crashed for the 3rd time after the
system rebooted and the instance was therefore bounced. I would much
appreciate if someone can help me out on this...

I have tried to increase the number of replicas and update elasticsearch
instance to the newer version 0.19.8, but none of them actually worked.
Every time elasticsearch crashed, the pattern was pretty similar ( listed
as follows). It will keep rolling suggesting almost all indices have this
issue and create a log of monster size. For your information, I indexed 30
indices with size around ~25GB to a elasticsearch cluster with 2 nodes, 5
shards each, and 2 replicas each.

WARNING: [dev-bry200163108d] [coverage-elastic1346994255418][0] failed to
start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[coverage-elastic1346994255418][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:228)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[coverage-elastic1346994255418][0] Failed to open reader on writer
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:286)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:579)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:188)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
... 3 more

I do notice that if I disable flush on translog (index.translog.*
disable_flush=true*), the cluster would be fine even it is killed due to
system reboot. If I can guarantee a flush operation is executed for all
writing/updating index operations, can I safely just disable the translog
flush forever?

--


(shengcer) #2

I forgot to mention that the hard limit on OS is 16384, and I wrote a tool
to start elasticsearch always with that limit. When OS reboots, that tool
would be automatically called to bring up elasticsearch. But I still get
too many open files open error, and log shows some files are missing

WARNING: [dev-bry200163111d] sending failed shard for
[client-elastic1347095014701][1], node[coS8PH1jQdq78FCkjY_TLw], [R],
s[INITIALIZING], reason [Failed to start shard, message
[RecoveryFailedException[[client-elastic1347095014701][1]: Recovery failed
from
[dev-bry200163108d][qqCnD112Qiqu8qLWxelo1w][inet[/171.150.210.106:14701]]
into
[dev-bry200163111d][coS8PH1jQdq78FCkjY_TLw][inet[/171.150.210.107:14701]]];
nested:
RemoteTransportException[[dev-bry200163108d][inet[/171.150.210.106:14701]][index/shard/recovery/startRecovery]];
nested: RecoveryEngineException[[client-elastic1347095014701][1] Phase[1]
Execution failed]; nested:
RecoverFilesRecoveryException[[client-elastic1347095014701][1] Failed to
transfer [1] files with total size of [147.9mb]]; nested:
FileNotFoundException[/.statelite/tmpfs/data/clobber/local0/services/dev/elasticsearch/elasticsearch-0.0.0/cache/clobber-dev/nodes/0/indices/client-elastic1347095014701/1/index/segments_m
(Too many open files)]; ]]

This only happens when the cluster crashes in a disastrous manner, so I
don't really buy that raising hard limit of OS would be the fix.

On Saturday, September 8, 2012 9:05:09 PM UTC-4, Sheng wrote:

Hi,

My elasticsearch instance has been crashed for the 3rd time after the
system rebooted and the instance was therefore bounced. I would much
appreciate if someone can help me out on this...

I have tried to increase the number of replicas and update elasticsearch
instance to the newer version 0.19.8, but none of them actually worked.
Every time elasticsearch crashed, the pattern was pretty similar ( listed
as follows). It will keep rolling suggesting almost all indices have this
issue and create a log of monster size. For your information, I indexed 30
indices with size around ~25GB to a elasticsearch cluster with 2 nodes, 5
shards each, and 2 replicas each.

WARNING: [dev-bry200163108d] [coverage-elastic1346994255418][0] failed to
start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException:
[coverage-elastic1346994255418][0] failed recovery
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:228)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
Caused by: org.elasticsearch.index.engine.EngineCreationFailureException:
[coverage-elastic1346994255418][0] Failed to open reader on writer
at
org.elasticsearch.index.engine.robin.RobinEngine.start(RobinEngine.java:286)
at
org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryPrepareForTranslog(InternalIndexShard.java:579)
at
org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:188)
at
org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:177)
... 3 more

I do notice that if I disable flush on translog (index.translog.*
disable_flush=true*), the cluster would be fine even it is killed due to
system reboot. If I can guarantee a flush operation is executed for all
writing/updating index operations, can I safely just disable the translog
flush forever?

--


(system) #3