Snapshots are failing


(maf) #1

Hello,

I'm experimenting with snapshots to S3, but I'm having no luck. The cluster
consists of 8 nodes (i2.2xlarge). The index I'm trying to snapshot is
2.91T, has 16 shards and 1 replica. I shoudl perhaps also mention that this
is running Elasticsearch version 1.1.1.

Initially when I initiate the snapshot process everything looks good. But
after a while shards start failing. In the logs I can find messages like
this:
[2015-02-05 07:57:59,029][WARN ][index.merge.scheduler ] [machine_name]
[cluster_day][13] failed to merge
java.io.IOException: Map failed
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:849)
at org.apache.lucene.store.MMapDirectory.map(MMapDirectory.java:283)
at
org.apache.lucene.store.MMapDirectory$MMapIndexInput.(MMapDirectory.java:228)
at
org.apache.lucene.store.MMapDirectory.openInput(MMapDirectory.java:195)
at
org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:80)
at
org.elasticsearch.index.store.Store$StoreDirectory.openInput(Store.java:473)
at
org.apache.lucene.codecs.lucene46.Lucene46FieldInfosReader.read(Lucene46FieldInfosReader.java:52)
at
org.apache.lucene.index.SegmentReader.readFieldInfos(SegmentReader.java:215)
at org.apache.lucene.index.SegmentReader.(SegmentReader.java:95)
at
org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:141)
at
org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4273)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3743)
at
org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at
org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
at
org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)
Caused by: java.lang.OutOfMemoryError: Map failed
at sun.nio.ch.FileChannelImpl.map0(Native Method)
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:846)
... 14 more

After a while this stops happening but the snapshot has failed:
[2015-02-05 13:26:25,232][WARN ][snapshots ] [machine_name]
[[cluster_day][13]] [rf_es_snapshot:cluster_full] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException:
[cluster_day][13] Failed to snapshot
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:100)
at
org.elasticsearch.snapshots.SnapshotsService$5.run(SnapshotsService.java:694)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
Caused by: org.elasticsearch.index.engine.EngineClosedException:
[cluster_day][13] CurrentState[CLOSED]
at
org.elasticsearch.index.engine.internal.InternalEngine.ensureOpen(InternalEngine.java:900)
at
org.elasticsearch.index.engine.internal.InternalEngine.flush(InternalEngine.java:746)
at
org.elasticsearch.index.engine.internal.InternalEngine.snapshotIndex(InternalEngine.java:1045)
at
org.elasticsearch.index.shard.service.InternalIndexShard.snapshotIndex(InternalIndexShard.java:618)
at
org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.snapshot(IndexShardSnapshotAndRestoreService.java:83)
... 4 more

There have been a number of other exceptions in between but most seem to be
related to out of memory. Should really a snpshot require so much memory?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a940847c-9005-47a5-b8af-79ecb3f3b864%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Pratikshya Kuinkel) #2

Hi,

I am encountering the same problem. I am also creating snapshots to S3 repository, it starts all well and after a while it fails with "Failed to snapshot]; nested: EngineClosedException". Did you find any solution for this? Or was it to do with the memory?


(system) #3