A shard closed suddenly on indexing data

Elasticsearch version

bin/elasticsearch --version
7.1.1

Logstash version

7.1.1

Plugins installed

bin/elasticsearch-plugin list
analysis-ik 

JVM version

java -version
java version "1.8.0_91"
Java(TM) SE Runtime Environment (build 1.8.0_91-b14)
Java HotSpot(TM) 64-Bit Server VM (build 25.91-b14, mixed mode)

OS version

uname -a
Linux log-es05.com 2.6.32-642.6.2.el6.x86_64 #1 SMP Wed Oct 26 06:52:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

Related config

node.master: false
node.data: true
node.ingest: true
node.ml: false
xpack.ml.enabled: true

issue

i am doing log analysis using Filebeat (7.1.1) -> logstash(7.1.1) -> Elasticsearch (7.1.1)
but there is a sudden relocation of my index, and i find some error log from the master node
if you want the total , i also give the total file [here](if you want the total , i also give the total file here https://github.com/chenchuangc/test_git/blob/master/es_error.md)
https://github.com/chenchuangc/test_git/blob/master/es_error.md

[2019-08-07T14:52:05,693][WARN ][o.e.c.r.a.AllocationService] [ES01] failing shard [failed shard, shard [ktest-service-2019.08.07][24], node[fkr_Uo8IROeC-SN2Qs0MbA], [R], s[STARTED], a[id=ViYqHgZBR4mdVXa76ZDyUQ], message [failed to perform indices:data/write/bulk[s] on replica [ktest-service-2019.08.07][24], node[fkr_Uo8IROeC-SN2Qs0MbA], [R], s[STARTED], a[id=ViYqHgZBR4mdVXa76ZDyUQ]], failure [RemoteTransportException[[ESV11][10.66.3.247:9300][indices:data/write/bulk[s][r]]]; nested: AlreadyClosedException[[ktest-service-2019.08.07][24] engine is closed]; nested: ArrayIndexOutOfBoundsException[67]; ], markAsStale [true]]
org.elasticsearch.transport.RemoteTransportException: [ESV11][10.66.3.247:9300][indices:data/write/bulk[s][r]]
Caused by: org.apache.lucene.store.AlreadyClosedException: [ktest-service-2019.08.07][24] engine is closed
        at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:789) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.engine.Engine.ensureOpen(Engine.java:798) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:826) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:789) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:762) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:726) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:416) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:386) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:373) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:79) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:628) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:588) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShard$4.onResponse(IndexShard.java:2733) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShard$4.onResponse(IndexShard.java:2711) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:269) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:236) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShard.lambda$acquireReplicaOperationPermit$18(IndexShard.java:2671) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShard.innerAcquireReplicaOperationPermit(IndexShard.java:2778) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2670) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.acquireReplicaOperationPermit(TransportReplicationAction.java:992) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:698) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:571) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:556) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:251) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.lambda$messageReceived$0(SecurityServerTransportInterceptor.java:300) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$15(AuthorizationService.java:341) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:192) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.1.1.jar:7.1.1]
        at 

.....

and there is also a cause by , to long to be at the whole post


Caused by: java.lang.ArrayIndexOutOfBoundsException: 67
        at org.apache.lucene.codecs.lucene50.ForUtil.readBlock(ForUtil.java:196) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$BlockPostingsEnum.refillDocs(Lucene50PostingsReader.java:615) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$BlockPostingsEnum.nextDoc(Lucene50PostingsReader.java:661) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.index.MappingMultiPostingsEnum$MappingPostingsSub.nextDoc(MappingMultiPostingsEnum.java:51) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.index.DocIDMerger$SequentialDocIDMerger.next(DocIDMerger.java:99) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.index.MappingMultiPostingsEnum.nextDoc(MappingMultiPostingsEnum.java:103) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:135) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:865) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:344) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:169) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:244) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:139) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4459) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4054) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:625) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
        at org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:101) ~[elasticsearch-7.1.1.jar:7.1.1]
        at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:662) ~[lucene-core-8.0.0.jar:8.0.0 2ae4746365c1ee72a0047ced7610b2096e438979 - jimczi - 2019-03-08 11:58:55]
[2019-08-07T14:52:05,800][INFO ][o.e.c.r.a.AllocationService] [ES01] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[ktest-service-2019.08.07][24]] ...]).





This looks like index corruption to me - there's a similar Lucene issue at https://issues.apache.org/jira/browse/LUCENE-8252. Can you run checkindex on the problematic shard? See https://www.elastic.co/blog/found-dive-into-elasticsearch-storage#fixing-problematic-shards for details on how to run it.

thank you very much @AlanWoodward

  1. After the error the cluster do a reroute (using 5min for 10G data) automatic, and there is no problematic shard in the origin node. so i can not run checkindex on the problematic shard

  2. you see, the lucene issue affect version is 6.6.2
    but i see my es cluster is


{
  "name" : "ESV09",
  "cluster_name" : "log-application",
  "cluster_uuid" : "i_y2nBUxRrWLydBgAlzilg",
  "version" : {
    "number" : "7.1.1",
    "build_flavor" : "default",
    "build_type" : "tar",
    "build_hash" : "7a013de",
    "build_date" : "2019-05-23T14:04:00.380842Z",
    "build_snapshot" : false,
    "lucene_version" : "8.0.0",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}


lucene_version is 8.0.0

I don't think it's due to a bug in Lucene, but rather a hardware or OS/network error writing data. Has it happened again? If it's a one-off then this is just elasticsearch detecting a problem with the data and repairing it, but if it's recurrent then we'll need to dig deeper.

1 Like

thank you so much , i hava just met this one time from the cluster start , maybe just like you said ,it is a hardware error, i will continue to watch about it .
thank you one more time :grin: