Found unrecoverable error for ES 6.0

Log:
17/11/28 13:52:52 ERROR TaskContextImpl: Error in TaskCompletionListener
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [x.x.x.x:9205] returned Internal Server Error(500) - compound sub-files must have a valid codec header and footer: file is too small (0 bytes) (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/data1/elasticsearch/data/nodes/0/indices/S93P5ab_S42YDhmslWDKAQ/33/index/_9qa_Lucene50_0.doc"))); Bailing out..
at org.elasticsearch.hadoop.rest.RestClient.processBulkResponse(RestClient.java:251)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:203)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:248)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:270)
at org.elasticsearch.hadoop.rest.RestRepository.close(RestRepository.java:295)
at org.elasticsearch.hadoop.rest.RestService$PartitionWriter.close(RestService.java:121)
at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply(EsRDDWriter.scala:60)
at org.elasticsearch.spark.rdd.EsRDDWriter$$anonfun$write$1.apply(EsRDDWriter.scala:60)
at org.apache.spark.TaskContext$$anon$1.onTaskCompletion(TaskContext.scala:123)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:97)
at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:95)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:95)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
17/11/28 13:52:52 ERROR Executor: Exception in task 5980.0 in stage 1.0 (TID 7319)
org.apache.spark.util.TaskCompletionListenerException: Found unrecoverable error [x.x.x.x:9205] returned Internal Server Error(500) - compound sub-files must have a valid codec header and footer: file is too small (0 bytes) (resource=BufferedChecksumIndexInput(MMapIndexInput(path="/data1/elasticsearch/data/nodes/0/indices/S93P5ab_S42YDhmslWDKAQ/33/index/_9qa_Lucene50_0.doc"))); Bailing out..
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:105)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

My Cluster has 5 SSD Servers with CentOS7. There are 6 instances on every Server. They are belonged to data nodes. Other 3 virtual machine are master nodes, 3 virtual machine are ingest nodes.

Server log:
[2017-11-29T20:37:40,449][WARN ][o.e.c.a.s.ShardStateAction] [host:9300] [auto-index-2017-11-28][3] received shard failed for shard id [[auto-index-2017-11-28][3]], allocation id [midSbxhxTqyKDGwhTPFmng], primary term [0], message [shard failure, reason [refresh failed]], failure [CorruptIndexException[Problem reading index from store(MMapDirectory@/data4/elasticsearch/data/nodes/0/indices/5MSqrobDQ2mSzZzCZGO-5g/3/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@cd8c03a) (resource=store(MMapDirectory@/data4/elasticsearch/data/nodes/0/indices/5MSqrobDQ2mSzZzCZGO-5g/3/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@cd8c03a))]; nested: EOFException[read past EOF: MMapIndexInput(path="/data4/elasticsearch/data/nodes/0/indices/5MSqrobDQ2mSzZzCZGO-5g/3/index/_3iu.cfe")]; ]
org.apache.lucene.index.CorruptIndexException: Problem reading index from store(MMapDirectory@/data4/elasticsearch/data/nodes/0/indices/5MSqrobDQ2mSzZzCZGO-5g/3/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@cd8c03a) (resource=store(MMapDirectory@/data4/elasticsearch/data/nodes/0/indices/5MSqrobDQ2mSzZzCZGO-5g/3/index lockFactory=org.apache.lucene.store.NativeFSLockFactory@cd8c03a))
at org.apache.lucene.index.SegmentCoreReaders.(SegmentCoreReaders.java:140) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:78) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:208) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:258) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:105) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:490) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:293) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:268) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:258) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:104) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:140) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:156) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253) ~[lucene-core-7.0.1.jar:7.0.1 8d6c3889aa543954424d8ac1dbb3f03bf207140b - sarowe - 2017-10-02 14:36:35]
at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1207) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:855) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:1207) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:855) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.IndexService.maybeRefreshEngine(IndexService.java:695) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.IndexService.access$400(IndexService.java:97) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:899) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.index.IndexService$BaseAsyncTask.run(IndexService.java:809) ~[elasticsearch-6.0.0.jar:6.0.0]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) ~[elasticsearch-6.0.0.jar:6.0.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_152]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_152]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_152]
Caused by: java.io.EOFException: read past EOF: MMapIndexInput(path="/data4/elasticsearch/data/nodes/0/indices/5MSqrobDQ2mSzZzCZGO-5g/3/index/_3iu.cfe")

It seems ES has a serious bug.

can you check the logfiles of the affected node for any exceptions/corruptions. It looks to me as if some data on there is corrupted, because some file is of size 0, as mentioned at the top.

I find a work round. When file system is ext4 or xfs, this issue must happen. But I change it to btrfs, there is no this issue. So I think it's relative with ssd, file system and os.

I think the majority of users runs on ext4 or xfs instead of brtfs. Are you running any special linux distribution? Also, is this an NFS volume? Curious about your setup...

The os is "CentOS Linux release 7.2.1511 (Core)". After weekend running, btrfs also has issue. The shards become unassign. They are not NFS volume. No clue for this issue now. My setup consist of 3 master, 2 ingest, 30 data. 30 data are deployed on 5 ssd machines. All roles are running on CentOS7.

have you tried running on another physical harddisk on the same node, and check if that keeps happening - I'd slowly consider a hardware failure here, unless there is some fancy script that nulls out your files...

have you checked dmesg output that might indicate a hardware failure?

Yes, 1 SSD machine has issue, then 1 node is disconnected. So btrfs is ok for es 6.0. I will continue see it after now. Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.