I got following exception in Elasticsearch log file:
[2016-04-07 11:26:37,740][WARN ][indices.cluster ] [data-node-vm2] [[mmx_20160210][2]] marking and sending shard failed due to [failed to create shard]
org.elasticsearch.index.shard.IndexShardCreationException: [mmx_20160210][2] failed to create shard
at org.elasticsearch.index.IndexService.createShard(IndexService.java:357)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:188)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:158)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.lucene.store.LockObtainFailedException: Can't lock shard [mmx_20160210][2], timed out after 5000ms
at org.elasticsearch.env.NodeEnvironment$InternalShardLock.acquire(NodeEnvironment.java:582)
at org.elasticsearch.env.NodeEnvironment.shardLock(NodeEnvironment.java:510)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:310)
What is the possible reason of "failed to create shard" and "can't lock shard, timed out after xxxms"?
You can email it to me - it's my first name @elastic.co. However, since it is 1.7.1, it's most likely on of the know issues that were causing deadlocks like this. They were fixed in 2.1.
Seems the hot_thread method just return one day data? so far all result is for 8 April. But the issue we met was on 7 Aril, and the cluster works fine now.
Anyway to get the history data of hot threads?
Judging from the logs that you sent me a large portion of your cluster has disconnected from the master node. It could have been caused by network disconnects or node overload (for example because of node running out of memory). A careful analysis of log files on the master node and nodes that have disconnected could reveal the root cause. Everything else seems to be a consequence of this initial disconnect.
Since we built our cluster on top of Azure IaaS VMs and virtual network, so there might be some network issue at that time period.
Still some questions need your confirmation:
Regarding to the possibility of out of memory on nodes, is there any specific ES exception for such kind issue? I didn't find OOM issue in our nodes log file.
As I mentioned in the email, we found the CPU processor time on several data nodes are greater than 90%, does the too many node disconnection issue might finally cause the high CPU processor time? BTW: per your reporting, NO spike traffic to our cluster.
There are two index are always in RED status even all nodes added back to cluster, is there any issue on ES side? Or any way to recover these two index? I tried close and re-open, did not work.
Sometimes when a node runs out of memory you can see OOM exception in the log, but this is not always the case. A node can sometimes get into a condition when it goes into longer and longer GC cycles without actually throwing any OOM exceptions. So, the best way to detect this condition is by monitoring elasticsearch heap usage.
One processor or overall node capacity? So, is it 90% of a single core or pretty much all cores are at 90%?
It depends on why the indices are red. Typically, by analyzing logs on master you can determine the reason why primary shards for these indices were not allocated.
what is the best practice to avoid such longer GC? Before this long GC, the only exceptions i found from the log is the suddenly disconnection of several nodes.
Please check chapters on monitoring and heap sizing. I suspect these long GC pauses might have been caused by swapping. If this is the case, it might have caused nodes leaving the cluster as well.
org.elasticsearch.index.shard.IndexShardCreationException: [xs_meta][0] failed to create shard
at org.elasticsearch.index.IndexService.createShard(IndexService.java:344)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyInitializingShard(IndicesClusterStateService.java:704)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyNewOrUpdatedShards(IndicesClusterStateService.java:605)
at org.elasticsearch.indices.cluster.IndicesClusterStateService.clusterChanged(IndicesClusterStateService.java:185)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:480)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:204)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:167)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchParseException: Failed to parse [64mb]
at org.elasticsearch.common.unit.TimeValue.parseTimeValue(TimeValue.java:256)
at org.elasticsearch.common.settings.ImmutableSettings.getAsTime(ImmutableSettings.java:381)
at org.elasticsearch.index.translog.TranslogService.(TranslogService.java:85)
at sun.reflect.GeneratedConstructorAccessor43.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
at org.elasticsearch.common.inject.DefaultConstructionProxyFactory$1.newInstance(DefaultConstructionProxyFactory.java:54)
at org.elasticsearch.common.inject.ConstructorInjector.construct(ConstructorInjector.java:86)
at org.elasticsearch.common.inject.ConstructorBindingImpl$Factory.get(ConstructorBindingImpl.java:98)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter$1.call(ProviderToInternalFactoryAdapter.java:45)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:837)
at org.elasticsearch.common.inject.ProviderToInternalFactoryAdapter.get(ProviderToInternalFactoryAdapter.java:42)
at org.elasticsearch.common.inject.Scopes$1$1.get(Scopes.java:57)
at org.elasticsearch.common.inject.InternalFactoryToProviderAdapter.get(InternalFactoryToProviderAdapter.java:45)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:200)
at org.elasticsearch.common.inject.InjectorBuilder$1.call(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorImpl.callInContext(InjectorImpl.java:830)
at org.elasticsearch.common.inject.InjectorBuilder.loadEagerSingletons(InjectorBuilder.java:193)
at org.elasticsearch.common.inject.InjectorBuilder.injectDynamically(InjectorBuilder.java:175)
at org.elasticsearch.common.inject.InjectorBuilder.build(InjectorBuilder.java:110)
at org.elasticsearch.common.inject.InjectorImpl.createChildInjector(InjectorImpl.java:134)
at org.elasticsearch.common.inject.ModulesBuilder.createChildInjector(ModulesBuilder.java:69)
at org.elasticsearch.index.IndexService.createShard(IndexService.java:342)
... 9 more
Caused by: java.lang.NumberFormatException: For input string: "64mb"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at org.elasticsearch.common.unit.TimeValue.parseTimeValue(TimeValue.java:252)
... 31 more
i does not know what the 64mb is.because there is no 64mb value in my data.when i download the elasticsearch.tar file.and tar -zxvf it.He had recovered.
the main reason is the elasticsearch jar file or sh file is damaged.how strange the problem is!
but,i still does not know which file is damaged.i am shure the index file is not damaged.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.