Shard recovery fails after resizing Google Cloud Platform's Persistent Disk


(Pratyush Singh) #1

I am running Elasticsearch inside Kubernetes cluster on Google Cloud Platform. The index is stored on a persistent disk. [Kubernetes Configuration]

Elasticsearch version is 2.0.0.

Recently, the disk was full and I resized it from 100GB to 200GB using Google Cloud Console. I stopped the Kubernetes service running ES -

kubectl delete -R -f /path/to/elasticsearch/configs

Then I created a compute engine instance and resized partition following this document.

Then I recreated ES deployment on Kubernetes and was welcomed by this error -

I  [2017-09-09 14:53:35,843][WARN ][cluster.action.shard     ] [Controller] [messages_week][4] received shard failed for [messages_week][4], node[3cz-CBk5Ro-0MCy0A9cp6A], [P], v[137], s[INITIALIZING], a[id=2gM_hcRETP2Eg2vsAjJkyQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-09-09T14:52:56.492Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30]; ]], indexUUID [vhW2TG5uTHiYPUwNZRJSiQ], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30]; ]
 
I  [messages_week][[messages_week][4]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30];
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:258)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:60)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:133)
 
I  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 
I  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 
I  	at java.lang.Thread.run(Thread.java:745)
 
I  Caused by: [messages_week][[messages_week][4]] EngineCreationFailureException[failed to recover from translog]; nested: EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30];
 
I  	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:157)
 
I  	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
 
I  	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1349)
 
I  	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1344)
 
I  	at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:889)
 
I  	at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:866)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:249)
 
I  	... 5 more
 
I  Caused by: [messages_week][[messages_week][4]] EngineException[failed to recover from translog]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30];
 
I  	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:233)
 
I  	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:154)
 
I  	... 11 more
 
I  Caused by: TranslogCorruptedException[translog corruption while reading from stream]; nested: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30];
 
I  	at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1620)
 
I  	at org.elasticsearch.index.translog.TranslogReader.read(TranslogReader.java:132)
 
I  	at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.readOperation(TranslogReader.java:299)
 
I  	at org.elasticsearch.index.translog.TranslogReader$ReaderSnapshot.next(TranslogReader.java:290)
 
I  	at org.elasticsearch.index.translog.MultiSnapshot.next(MultiSnapshot.java:70)
 
I  	at org.elasticsearch.index.engine.InternalEngine.recoverFromTranslog(InternalEngine.java:219)
 
I  	... 12 more
 
I  Caused by: TranslogCorruptedException[translog stream is corrupted, expected: 0x64332db9, got: 0x74223a30]
 
I  	at org.elasticsearch.index.translog.Translog.verifyChecksum(Translog.java:1577)
 
I  	at org.elasticsearch.index.translog.Translog.readOperation(Translog.java:1610)
 
I  	... 17 more

How can I resolve this state?

Thanks in advance.


(Mark Walkom) #2

You should really upgrade.

Did you/can you run a filesystem check?


(Pratyush Singh) #3

I ran a filesystem check after resizing and cross checked it just now. Everything is fine.


(Mark Walkom) #4

You can try deleting the translog files that are being referenced, but you lose whatever data is in them.

Not sure about the root cause though.


(Pratyush Singh) #5

What exactly would I lose? Will it delete all the rows populated in the index?


(Mark Walkom) #6

No, just things that may not have been written into the index. It's usually a small amount of data, but I can't say exactly what.


(Pratyush Singh) #7

So shall I delete indices/*/*/translog/?


(Mark Walkom) #8

Is it complaining about all of them?


(Pratyush Singh) #9

So I deleted the ones that were throwing some error. After this, there was "file not found" exception.

So I touched .ckp files and now there is an EOFException for those files -

I  [2017-09-10 05:22:08,522][WARN ][cluster.action.shard     ] [Bora] [import_profiles][4] received shard failed for [import_profiles][4], node[LYLghrxsSMOBydgtD_Xzgg], [P], v[141], s[INITIALIZING], a[id=i0RPT1IhT3yGgPoRKF7mUQ], unassigned_info[[reason=ALLOCATION_FAILED], at[2017-09-10T05:22:08.495Z], details[failed recovery, failure IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]], indexUUID [mMC5fBsDS9yVan4DpKp5ng], message [failed recovery], failure [IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException; ]
 
I  [import_profiles][[import_profiles][4]] IndexShardRecoveryException[failed to recovery from gateway]; nested: EngineCreationFailureException[failed to create engine]; nested: EOFException;
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:258)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.access$100(StoreRecoveryService.java:60)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService$1.run(StoreRecoveryService.java:133)
 
I  	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 
I  	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 
I  	at java.lang.Thread.run(Thread.java:745)
 
I  Caused by: [import_profiles][[import_profiles][4]] EngineCreationFailureException[failed to create engine]; nested: EOFException;
 
I  	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:135)
 
I  	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25)
 
I  	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1349)
 
I  	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1344)
 
I  	at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:889)
 
I  	at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:866)
 
I  	at org.elasticsearch.index.shard.StoreRecoveryService.recoverFromStore(StoreRecoveryService.java:249)
 
I  	... 5 more
 
I  Caused by: java.io.EOFException
	at org.apache.lucene.store.InputStreamDataInput.readByte(InputStreamDataInput.java:37)
	at org.apache.lucene.store.DataInput.readInt(DataInput.java:101)
	at org.apache.lucene.store.DataInput.readLong(DataInput.java:157)
	at org.elasticsearch.index.translog.Checkpoint.<init>(Checkpoint.java:53)
	at org.elasticsearch.index.translog.Checkpoint.read(Checkpoint.java:82)
	at org.elasticsearch.index.translog.Translog.<init>(Translog.java:165)
	at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:188)
	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:131)
	... 11 more

Should I have not deleted the ckp files? How can I recover the index from here now?


(Pratyush Singh) #10

@warkolm: Could you please take a look here again? Thanks in advance.


(Mark Walkom) #11

I really don't know sorry, this is not anything I have seen or know how to recover from.

Did you take backups?


(Pratyush Singh) #12

I'm afraid no. Learning it the hard way.

Thanks for sticking around and helping out :smile:


(Mark Walkom) #13

Would https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-translog.html#corrupt-translog-truncation help at all?


(system) #14

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.