Shard recovery fails after resizing Google Cloud Platform's Persistent Disk

I am running Elasticsearch inside Kubernetes cluster on Google Cloud Platform. The index is stored on a persistent disk. [Kubernetes Configuration]

Elasticsearch version is 2.0.0.

Recently, the disk was full and I resized it from 100GB to 200GB using Google Cloud Console. I stopped the Kubernetes service running ES -

kubectl delete -R -f /path/to/elasticsearch/configs

Then I created a compute engine instance and resized partition following this document.

Then I recreated ES deployment on Kubernetes and was welcomed by this error -

I  [2017-09-09 14:53:35,843][WARN ][cluster.action.shard     ] [Controller] [messages_week][4] received shard failed for [messages_week][4], node[3cz-CBk5Ro-0MCy0A9cp6A], [P], v[137], s[INITIALIZING], a[id=2gM_hcRETP2Eg2vsAjJkyQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-09-09T14:52:56.492Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30]; ]], indexUUID [vhW2TG5uTHiYPUwNZRJSiQ], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30]; ]
 
I  [messages_week][[messages_week][4]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30];
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:258)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:60)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:133)
 
I  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 
I  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 
I  	at java.lang.Thread.run(Thread.java:745)
 
I  Caused by: [messages_week][[messages_week][4]] EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30];
 
I  	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157)
 
I  	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
 
I  	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1349)
 
I  	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1344)
 
I  	at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:889)
 
I  	at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:866)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:249)
 
I  	... 5 more
 
I  Caused by: [messages_week][[messages_week][4]] EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30];
 
I  	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:233)
 
I  	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:154)
 
I  	... 11 more
 
I  Caused by: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30];
 
I  	at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1620)
 
I  	at org.elasticsearch.index.translog.TranslogReader.read(TranslogReader.java:132)
 
I  	at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.readOperation(TranslogReader.java:299)
 
I  	at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.next(TranslogReader.java:290)
 
I  	at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70)
 
I  	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:219)
 
I  	... 12 more
 
I  Caused by: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30]
 
I  	at org.elasticsearch.index.translog.Translog.verifyChecksum(Translog.java:1577)
 
I  	at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1610)
 
I  	... 17 more

How can I resolve this state?

Thanks in advance.

You should really upgrade.

Did you/can you run a filesystem check?

I ran a filesystem check after resizing and cross checked it just now. Everything is fine.

You can try deleting the translog files that are being referenced, but you lose whatever data is in them.

Not sure about the root cause though.

What exactly would I lose? Will it delete all the rows populated in the index?

No, just things that may not have been written into the index. It's usually a small amount of data, but I can't say exactly what.

So shall I delete indices/*/*/translog/?

Is it complaining about all of them?

So I deleted the ones that were throwing some error. After this, there was "file not found" exception.

So I touched .ckp files and now there is an EOFException for those files -

I  [2017-09-10 05:22:08,522][WARN ][cluster.action.shard     ] [Bora] [import_profiles][4] received shard failed for [import_profiles][4], node[LYLghrxsSMOBydgtD_Xzgg], [P], v[141], s[INITIALIZING], a[id=i0RPT1IhT3yGgPoRKF7mUQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-09-10T05:22:08.495Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]], indexUUID [mMC5fBsDS9yVan4DpKp5ng], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]
 
I  [import_profiles][[import_profiles][4]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:258)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:60)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:133)
 
I  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 
I  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 
I  	at java.lang.Thread.run(Thread.java:745)
 
I  Caused by: [import_profiles][[import_profiles][4]] EngineCreationFailureException[failed to create engine]; nested: EOFException;
 
I  	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:135)
 
I  	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
 
I  	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1349)
 
I  	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1344)
 
I  	at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:889)
 
I  	at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:866)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:249)
 
I  	... 5 more
 
I  Caused by: java.io.EOFException
	at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)
	at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
	at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)
	at org.elasticsearch.index.translog.Checkpoint.<init>(Checkpoint.java:53)
	at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:82)
	at org.elasticsearch.index.translog.Translog.<init>(Translog.java:165)
	at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:188)
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:131)
	... 11 more

Should I have not deleted the ckp files? How can I recover the index from here now?

@warkolm: Could you please take a look here again? Thanks in advance.

I really don't know sorry, this is not anything I have seen or know how to recover from.

Did you take backups?

I'm afraid no. Learning it the hard way.

Thanks for sticking around and helping out :smile:

Would https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html#corrupt-translog-truncation help at all?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.