Failed to create shard exception

Youxu · April 7, 2016, 3:54pm

I got following exception in Elasticsearch log file:

[2016-04-07 11:26:37,740][WARN ][indices.cluster ] [data-node-vm2] [[mmx_20160210][2]] marking and sending shard failed due to [failed to create shard]
org.elasticsearch.index.shard.IndexShardCreationException: [mmx_20160210][2] failed to create shard
at org.elasticsearch.index.IndexService.createShard(IndexService.java:357)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [mmx_20160210][2], timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:582)
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:510)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:310)

What is the possible reason of "failed to create shard" and "can't lock shard, timed out after xxxms"?

Igor_Motov · April 7, 2016, 7:08pm

Which version of elasticsearch are you running?

Could you run and post here the result of the following command?

curl -XGET "http://localhost:9202/_nodes/hot_threads?threads=10000&ignore_idle_threads=false"

Youxu · April 8, 2016, 1:09am

Thanks Igor,
we are running ES 1.7.1
Regarding the hot_threads result, the output is around 9 M, I am afraid I cannot post the whole output here.

What is the best way I send the output to you?

Igor_Motov · April 8, 2016, 1:18am

You can email it to me - it's my first name @elastic.co. However, since it is 1.7.1, it's most likely on of the know issues that were causing deadlocks like this. They were fixed in 2.1.

Youxu · April 8, 2016, 1:25am

Seems the hot_thread method just return one day data? so far all result is for 8 April. But the issue we met was on 7 Aril, and the cluster works fine now.
Anyway to get the history data of hot threads?

warkolm · April 8, 2016, 4:09am

Hot threads is a point in time API, there is no history for it.

Youxu · April 8, 2016, 4:11am

Thanks.
So, the only way is to dig into the logs to find more hints from logs.
Will update you back when I get more hints from logs.

Igor_Motov · April 15, 2016, 2:25pm

Judging from the logs that you sent me a large portion of your cluster has disconnected from the master node. It could have been caused by network disconnects or node overload (for example because of node running out of memory). A careful analysis of log files on the master node and nodes that have disconnected could reveal the root cause. Everything else seems to be a consequence of this initial disconnect.

Youxu · April 18, 2016, 1:39am

Thanks you very much!

Since we built our cluster on top of Azure IaaS VMs and virtual network, so there might be some network issue at that time period.

Still some questions need your confirmation:

Regarding to the possibility of out of memory on nodes, is there any specific ES exception for such kind issue? I didn't find OOM issue in our nodes log file.
As I mentioned in the email, we found the CPU processor time on several data nodes are greater than 90%, does the too many node disconnection issue might finally cause the high CPU processor time? BTW: per your reporting, NO spike traffic to our cluster.
There are two index are always in RED status even all nodes added back to cluster, is there any issue on ES side? Or any way to recover these two index? I tried close and re-open, did not work.

Thanks your time!

Br
-Xudong

Igor_Motov · April 18, 2016, 11:48am

Sometimes when a node runs out of memory you can see OOM exception in the log, but this is not always the case. A node can sometimes get into a condition when it goes into longer and longer GC cycles without actually throwing any OOM exceptions. So, the best way to detect this condition is by monitoring elasticsearch heap usage.

One processor or overall node capacity? So, is it 90% of a single core or pretty much all cores are at 90%?

It depends on why the indices are red. Typically, by analyzing logs on master you can determine the reason why primary shards for these indices were not allocated.

Youxu · April 21, 2016, 2:32am

Thanks Igor.
I did found following like warning about GC:

[2016-04-07 10:24:05,017][WARN ][monitor.jvm ] [data-node-vm3] [gc][old][1458984][36] duration [37.6s], collections [1]/[1.2m], total [37.6s]/[1.2m], memory [5gb]->[4.1gb]/[6.9gb], all_pools {[young] [1.6mb]->[386.5kb]/[532.5mb]}{[survivor] [66.5mb]->[43.6mb]/[66.5mb]}{[old] [4.9gb]->[4.1gb]/[6.3gb]}

[2016-04-07 10:29:37,438][WARN ][monitor.jvm ] [data-node-vm4] [gc][old][1458801][18] duration [26.4s], collections [1]/[1m], total [26.4s]/[41.5s], memory [6.2gb]->[4.6gb]/[6.9gb], all_pools {[young] [431.1mb]->[592.4kb]/[532.5mb]}{[survivor] [66.5mb]->[66.5mb]/[66.5mb]}{[old] [5.7gb]->[4.5gb]/[6.3gb]}

But all these warning happened around 1.5 hours after suddenly several nodes disconnect. So seems most likely it is due to network issue.

BTW: Could the several nodes disconnection issue finally cause the GC issue?

As a lesson learnt, next time when we found the similar issue we will check the hot threads at the first time.

warkolm · April 21, 2016, 4:29am

Given your GC is taking longer than 30 seconds, it's probable that is causing the timeouts. As that is the zen timeout threshold.

Youxu · April 21, 2016, 6:04am

what is the best practice to avoid such longer GC? Before this long GC, the only exceptions i found from the log is the suddenly disconnection of several nodes.

Igor_Motov · April 21, 2016, 12:42pm

Please check chapters on monitoring and heap sizing. I suspect these long GC pauses might have been caused by swapping. If this is the case, it might have caused nodes leaving the cluster as well.

haochun · June 14, 2016, 3:57am

i also accoured the same problem.my logs:

org.elasticsearch.index.shard.IndexShardCreationException: [xs_meta][0] failed to create shard
at org.elasticsearch.index.IndexService.createShard(IndexService.java:344)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:204)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:167)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse [64mb]
at org.elasticsearch.common.unit.TimeValue.parseTimeValue(TimeValue.java:256)
at org.elasticsearch.common.settings.ImmutableSettings.getAsTime(ImmutableSettings.java:381)
at org.elasticsearch.index.translog.TranslogService.(TranslogService.java:85)
at sun.reflect.GeneratedConstructorAccessor43.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:134)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:342)
... 9 more
Caused by: java.lang.NumberFormatException: For input string: "64mb"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.elasticsearch.common.unit.TimeValue.parseTimeValue(TimeValue.java:252)
... 31 more

dadoonet · June 14, 2016, 5:10am

64mb is not a time value.

haochun · June 14, 2016, 6:29am

i does not know what the 64mb is.because there is no 64mb value in my data.when i download the elasticsearch.tar file.and tar -zxvf it.He had recovered.

the main reason is the elasticsearch jar file or sh file is damaged.how strange the problem is!
but,i still does not know which file is damaged.i am shure the index file is not damaged.

warkolm · June 14, 2016, 6:31am

That is the issue, it's in the logs.

Please start a new thread for this, it's ultimately a different issue.

haochun · June 14, 2016, 6:32am

ok,i have resolve this problem now.

Topic		Replies	Views
ElasticSearch with > 40 nodes, missing shards and indexing troubles Elasticsearch	11	655	July 6, 2017
Cluster Health degraded overnight with no apparent reason Elasticsearch	5	1700	July 6, 2017
Cluster Failure Elasticsearch	2	257	July 6, 2017
New index immediately becomes red Elasticsearch	8	2062	July 6, 2017
Slow Shard Assignment Elasticsearch	6	1810	July 6, 2017

Failed to create shard exception

Related topics