Recurring index corruption

Sebastiano · December 29, 2022, 12:44pm

Yes they have one replica each. These are system indices so they are configured automatically. However, in one case, a shard and its replica were both un-allocated, most likely because they were both corrupted.

What I don't understand is that there seems no official reference to the fact that SMB is not a supported store. I saw your post about NFS (https://discuss.elastic.co/t/why-nfs-is-to-be-avoided-for-data-directories/215240), which clearly states that NFS is not supported (but no mention of SMB). Also, the presence of the storeSMB plugin (although experimental and to fix the cache manager being bypassed on writes) does suggest SMB is supported.

From https://livebook.manning.com/concept/lucene/remote-file-system

Remote file system	Notes
Samba/CIFS 1.0	The standard remote file system for Windows computers. Sharing a Lucene index works fine.
Samba/CIFS 2.0	The new version of Samba/CIFS that’s the default for Windows Server 2007 and Windows Vista. Lucene has trouble due to incoherent client-side caches.
Networked File System (NFS)	The standard remote file systems for most Unix OSs. Lucene has trouble due to both incoherent client-side caches as well as how NFS handles deletion of files that are held open by another computer.
Apple File Protocol (AFP)	Apple’s standard remote file system protocol. Lucene has trouble due to incoherent client-side caches.

From the same link: " NFS, AFP, and Samba/CIFS 2.0 are known to have intermittent problems when opening or reopening an index due to incoherent client-side caching. The problem only occurs when the writer has just committed changes to an index, and then on another computer a reader or another writer is opened or reopened. Thus you’re more likely to encounter this if you frequently try to reopen your readers and writer and often commit changes to the index. When you do encounter the issue, you’ll see an unexpected FileNotFoundException inside the open or reopen methods. Fortunately, the workaround is quite simple: retry a bit later, because typically the client-side caches will correct themselves after a certain amount of time."

The same link talks about file deletion being an issue with NFS, but no mention of SMB. Locking may be another cause of concern, but again, no explicit mention of SMB as: "NativeFSLockFactory: This is the default locking for FSDirectory, using java.nio native OS locking, which will never leave leftover lock files when the JVM exits. But this locking implementation may not work correctly over certain shared file systems, notably NFS.". On top of that, I cannot see any lock related exception in the console logs.

Additionally, it looks like locks are managed using the suggested workaround: "Note that none of these locking implementations are “fair.” For example, if a lock is already held by an existing writer, the new writer will simply retry, every one second by default, to obtain the lock."

Also, the corruption is limited to these 2 indexes (.kibana_8.4.1_001 and the .kibana_task_manager_8.4.1_001). We pushed 50+ million log traces into the ELK cluster in a day without any issue (we repeated this test several times) to the log indices being reported. However, log indexes are appended while the 2 corrupted indices seems to be updated.

Topic		Replies	Views
Corrupted all indices after a failure Elasticsearch	9	800	July 6, 2017
Error: CorruptIndexException when reading from gateway Elasticsearch	5	974	July 6, 2017
Could not lock IndexWriter isLocked [false] org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock Elasticsearch	24	4556	July 6, 2017
"failed to merge java.io.EOFException: read past EOF: NIOFSIndexInput(" Elasticsearch	17	4020	July 6, 2017
Total dataloss due to disk space issues Elasticsearch	8	474	July 6, 2017

Recurring index corruption

Related topics