Frequent shard failures

Akesh_Jadhav · December 29, 2024, 7:32am

Hi,

We are running a 1node Elasticsearch Cluster on Elastic stack version 7.17.9. The node are running on computer with SSD storage and 128GB RAM. But when I indexing some data on elastic after some time indexing getting stops elastic service still running but my documents not able to index as well as not able to see old ingested data as well as when I see index status stack management it seen as red when I look for elastic logs I get following logs

[2024-12-27T15:00:00,356][WARN ][o.e.t.ThreadPool         ] [DESKTOP-NHDT04G] failed to run scheduled task [org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker@56e6384c] on thread pool [same]
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:877) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.ensureOpen(IndexWriter.java:891) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.getFlushingBytes(IndexWriter.java:781) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.elasticsearch.index.engine.InternalEngine.getWritingBytes(InternalEngine.java:649) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.shard.IndexShard.getWritingBytes(IndexShard.java:1296) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.indices.IndexingMemoryController.getShardWritingBytes(IndexingMemoryController.java:184) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.runUnlocked(IndexingMemoryController.java:312) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.indices.IndexingMemoryController$ShardsIndicesStatusChecker.run(IndexingMemoryController.java:292) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.threadpool.Scheduler$ReschedulingRunnable.doRun(Scheduler.java:214) [elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) [elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) [elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.threadpool.ThreadPool$1.run(ThreadPool.java:444) [elasticsearch-7.17.9.jar:7.17.9]
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
	at java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:830) [?:?]
Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of bounds for length 66192
	at org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:207) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:230) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:75) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:116) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:165) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:186) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:974) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:527) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:491) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:208) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:415) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1471) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1444) ~[lucene-core-8.11.1.jar:8.11.1 0b002b11819df70783e83ef36b42ed1223c14b50 - janhoy - 2021-12-14 13:46:43]
	at org.elasticsearch.index.engine.InternalEngine.addDocs(InternalEngine.java:1310) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.engine.InternalEngine.indexIntoLucene(InternalEngine.java:1248) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:1051) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:1066) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:998) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:900) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:320) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:181) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:245) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:134) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.bulk.TransportShardBulkAction.dispatchedShardOperationOnPrimary(TransportShardBulkAction.java:74) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.action.support.replication.TransportWriteAction$1.doRun(TransportWriteAction.java:196) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:777) ~[elasticsearch-7.17.9.jar:7.17.9]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-7.17.9.jar:7.17.9]

I solve this problem using

elasticsearch-shard remove-corrupted-data --index mails --shard-id 0 --truncate-clean-translog -v

this command it solve my problem but I after some 2-3 hours I again face same problem. How to find cause of this issue and what is cause of this issue ? Please help urgently.
I go through the Troubleshooting corruption | Elasticsearch Guide [8.17] | Elastic
but didn't get how to find actual cause according to you what is recommended hardware architecture use to run Elasticsearch which not producing this issues or Whether I have to upgrade the elasticsearch with backward capability?
Please help me this is serious problem for me I have to solve it as soon as possible

Christian_Dahlqvist · December 29, 2024, 12:06pm

Exactly what type of SSD storage do you have? Is it a local disk mounted on a physical machine?

I would also recommend you upgrade to at least the latest 7.17 release.

Akesh_Jadhav · December 29, 2024, 12:08pm

Yes it is local disk mounted on physical machine "Model name - WD Blue SN5000 4TB" till now total 1TB data is indexed.

Akesh_Jadhav · December 29, 2024, 12:20pm

Does upgration solve this issue permanently because with upgradation need to do changes in code as well so it will take but after this working does it really solve my issue?

Christian_Dahlqvist · December 29, 2024, 12:31pm

You should not need to change any code to upgrade to the latest 7.17 release. I do not know if there are any issues related to this that may have been fixed in newer versions, so this will eliminate that as a potential issue. I suspect that there may be something wrong with your storage, and if that is the case upgrading will naturally not help. Have you checked your disk for issues?

leandrojmp · December 29, 2024, 2:17pm

Do you have anything else in your server that may be messing with elasticsearch data files, like some anti-virus or anything like that?

This behavior is not normal, I could suggested that something else is messing with your files or that you are starting to have hardware failure.

RainTown · December 29, 2024, 4:00pm

Please check system logs for IO related errors, eg /var/log/messages and similar log files. Output from "dmesg -T" might also be helpful, assuming Linux system of some sort. If this is Windows Anything, then ... risk of something else interfering with your data files is just a lot more significant, e.g. some anti-virus/anti-malware software doing something unhelpful.

From info provided it seems your storage is not stable, so it would be good time to make a snapshot/backup of your indices, e.g. attach another disk or use network storage. Unless it's trivial to re-index everything you have indexed already.

Obviously, a "1node Elasticsearch Cluster" is not ideal in terms of resilience, so I hope this is a test/POC system, and not a production/mission critical environment.

Akesh_Jadhav · December 29, 2024, 6:10pm

There is only Microsoft Defender there and if it starts hardware failure how to check which hardware causes this issue how to check these things because I test all hardware, and it passes that test this physical setup is 2 month old only Is there any other way to check this or any other reason for this issue. Please help me sir

RainTown · December 30, 2024, 12:29pm

Is this Windows surely solely and purely used for running Elasticsearch ? Half-joking, but have you considered not running Windows on this server? Linux is ... better.

A "hardware issue" is just the current best suggestion based on the limited information you shared. Share some more details on how the problem presents itself, what you are doing, your config, and maybe some other theory might appear.

As pointed out, if anything starts messing around with the files on the data partition, this can corrupt them. So check the system logs, the Microsoft Defender logs, and look for anything around storage/files.

Given what I know now, I'd strongly suggest you check the S.M.A.R.T data of your storage on periodic basis, There are many tools for this, as I'm not a windows fan I cannot recommend a specific one.

If you happen to have another server, maybe consider to duplicate your environment and see if the problem is seen there too.

And again, now is really good time to make backups/snapshots !!

Topic		Replies	Views
Shard failure continuously Elasticsearch	1	22	January 1, 2025
Failed shard on node [bnK31ibrRG6bSpw_pYK2BA]: shard failure, reason [corrupt file (source: [index id[CTR0CY4BdvDW7Z2cSc8C] origin[PRIMARY] seq#[27209007]])], failure org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Elasticsearch	9	421	April 15, 2024
org.apache.lucene.store.AlreadyClosedException: this IndexWriter is closed Elasticsearch	2	2776	March 17, 2021
Frequent shard failures Elasticsearch	7	699	July 20, 2023
Failed to execute on node Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed Elasticsearch	2	1213	February 7, 2018

Frequent shard failures

Related topics