Failed to create shard exception

I got following exception in Elasticsearch log file:

[2016-04-07 11:26:37,740][WARN ][indices.cluster ] [data-node-vm2] [[mmx_20160210][2]] marking and sending shard failed due to [failed to create shard]
org.elasticsearch.index.shard.IndexShardCreationException: [mmx_20160210][2] failed to create shard
at org.elasticsearch.index.IndexService.createShard(IndexService.java:357)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [mmx_20160210][2], timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:582)
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:510)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:310)

What is the possible reason of "failed to create shard" and "can't lock shard, timed out after xxxms"?

Which version of elasticsearch are you running?

Could you run and post here the result of the following command?

curl -XGET "http://localhost:9202/_nodes/hot_threads?threads=10000&ignore_idle_threads=false"

Thanks Igor,
we are running ES 1.7.1
Regarding the hot_threads result, the output is around 9 M, I am afraid I cannot post the whole output here.

What is the best way I send the output to you?

You can email it to me - it's my first name @elastic.co. However, since it is 1.7.1, it's most likely on of the know issues that were causing deadlocks like this. They were fixed in 2.1.

Seems the hot_thread method just return one day data? so far all result is for 8 April. But the issue we met was on 7 Aril, and the cluster works fine now.
Anyway to get the history data of hot threads?

Hot threads is a point in time API, there is no history for it.

Thanks.
So, the only way is to dig into the logs to find more hints from logs.
Will update you back when I get more hints from logs.

Judging from the logs that you sent me a large portion of your cluster has disconnected from the master node. It could have been caused by network disconnects or node overload (for example because of node running out of memory). A careful analysis of log files on the master node and nodes that have disconnected could reveal the root cause. Everything else seems to be a consequence of this initial disconnect.

Thanks you very much!

Since we built our cluster on top of Azure IaaS VMs and virtual network, so there might be some network issue at that time period.

Still some questions need your confirmation:

  1. Regarding to the possibility of out of memory on nodes, is there any specific ES exception for such kind issue? I didn't find OOM issue in our nodes log file.

  2. As I mentioned in the email, we found the CPU processor time on several data nodes are greater than 90%, does the too many node disconnection issue might finally cause the high CPU processor time? BTW: per your reporting, NO spike traffic to our cluster.

  3. There are two index are always in RED status even all nodes added back to cluster, is there any issue on ES side? Or any way to recover these two index? I tried close and re-open, did not work.

Thanks your time!

Br
-Xudong

Sometimes when a node runs out of memory you can see OOM exception in the log, but this is not always the case. A node can sometimes get into a condition when it goes into longer and longer GC cycles without actually throwing any OOM exceptions. So, the best way to detect this condition is by monitoring elasticsearch heap usage.

One processor or overall node capacity? So, is it 90% of a single core or pretty much all cores are at 90%?

It depends on why the indices are red. Typically, by analyzing logs on master you can determine the reason why primary shards for these indices were not allocated.

Thanks Igor.
I did found following like warning about GC:

[2016-04-07 10:24:05,017][WARN ][monitor.jvm ] [data-node-vm3] [gc][old][1458984][36] duration [37.6s], collections [1]/[1.2m], total [37.6s]/[1.2m], memory [5gb]->[4.1gb]/[6.9gb], all_pools {[young] [1.6mb]->[386.5kb]/[532.5mb]}{[survivor] [66.5mb]->[43.6mb]/[66.5mb]}{[old] [4.9gb]->[4.1gb]/[6.3gb]}

[2016-04-07 10:29:37,438][WARN ][monitor.jvm ] [data-node-vm4] [gc][old][1458801][18] duration [26.4s], collections [1]/[1m], total [26.4s]/[41.5s], memory [6.2gb]->[4.6gb]/[6.9gb], all_pools {[young] [431.1mb]->[592.4kb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [5.7gb]->[4.5gb]/[6.3gb]}

But all these warning happened around 1.5 hours after suddenly several nodes disconnect. So seems most likely it is due to network issue.

BTW: Could the several nodes disconnection issue finally cause the GC issue?

As a lesson learnt, next time when we found the similar issue we will check the hot threads at the first time.

Given your GC is taking longer than 30 seconds, it's probable that is causing the timeouts. As that is the zen timeout threshold.

what is the best practice to avoid such longer GC? Before this long GC, the only exceptions i found from the log is the suddenly disconnection of several nodes.

Please check chapters on monitoring and heap sizing. I suspect these long GC pauses might have been caused by swapping. If this is the case, it might have caused nodes leaving the cluster as well.

i also accoured the same problem.my logs:

org.elasticsearch.index.shard.IndexShardCreationException: [xs_meta][0] failed to create shard
at org.elasticsearch.index.IndexService.createShard(IndexService.java:344)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:204)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:167)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse [64mb]
at org.elasticsearch.common.unit.TimeValue.parseTimeValue(TimeValue.java:256)
at org.elasticsearch.common.settings.ImmutableSettings.getAsTime(ImmutableSettings.java:381)
at org.elasticsearch.index.translog.TranslogService.(TranslogService.java:85)
at sun.reflect.GeneratedConstructorAccessor43.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:134)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:342)
... 9 more
Caused by: java.lang.NumberFormatException: For input string: "64mb"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.elasticsearch.common.unit.TimeValue.parseTimeValue(TimeValue.java:252)
... 31 more

64mb is not a time value.

1 Like

i does not know what the 64mb is.because there is no 64mb value in my data.when i download the elasticsearch.tar file.and tar -zxvf it.He had recovered.

the main reason is the elasticsearch jar file or sh file is damaged.how strange the problem is!
but,i still does not know which file is damaged.i am shure the index file is not damaged.

That is the issue, it's in the logs.

Please start a new thread for this, it's ultimately a different issue.

ok,i have resolve this problem now.