Snapshot failed with partial state

Hi all !

We are using elasticsearch for a long time now and we get an error during a snapshot.

These snapshot storing metrics (1 indice by day / 12 shards), so they are pretty big: ~180G / 900M of documents.

Until now, we don't do a snapshot for these datas. But now, we have too.

The snapshot take a long time to be done, but we get a status "PARTIAL" at the end. If I don't add "PARTIAL" at "true" in curator config file, snapshot is failing cause some shard don't have primary status (which is wrong, cluster is green, and no yellow/red shard in monitoring)

Here the (partial) status of the snapshot: (error repeat multiple time)
{
"snapshots": [
{
"snapshot": "snapshot-poc-metrics-2018.12.10",
"uuid": "PGOWeVx5TeKWxcWt5ACayA",
"version_id": 6030099,
"version": "6.3.0",
"indices": [
"metrics-2018.11.30",
"metrics-2018.08.08",
.... (for a total of 140 indices)
],
"include_global_state": true,
"state": "PARTIAL",
"start_time": "2018-12-10T16:09:10.421Z",
"start_time_in_millis": 1544458150421,
"end_time": "2018-12-11T06:07:06.603Z",
"end_time_in_millis": 1544508426603,
"duration_in_millis": 50276182,
"failures": [
{
"index": "metrics-2018.11.30",
"index_uuid": "metrics-2018.11.30",
"shard_id": 6,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.19",
"index_uuid": "metrics-2018.11.19",
"shard_id": 6,
"reason": "IndexShardSnapshotFailedException[Aborted]",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.09.16",
"index_uuid": "metrics-2018.09.16",
"shard_id": 4,
"reason": "IndexShardSnapshotFailedException[java.lang.IllegalStateException: Unable to move the shard snapshot status to [FINALIZE]: expecting [STARTED] but got [ABORTED]]; nested: IllegalStateException[Unable to move the shard snapshot status to [FINALIZE]: expecting [STARTED] but got [ABORTED]]; ",
"node_id": "ApCInEgCQ1aSmwYEr39k_w",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.10",
"index_uuid": "metrics-2018.11.10",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Aborted]",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.08.24",
"index_uuid": "metrics-2018.08.24",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.20",
"index_uuid": "metrics-2018.11.20",
"shard_id": 5,
"reason":"IndexShardSnapshotFailedException[org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0]]; nested: AlreadyClosedException[store is already closed can't increment refCount current count [0]]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.10.12",
"index_uuid": "metrics-2018.10.12",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.29",
"index_uuid": "metrics-2018.11.29",
"shard_id": 3,
"reason":"IndexShardSnapshotFailedException[org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0]]; nested: AlreadyClosedException[store is already closed can't increment refCount current count [0]]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
.....
],
"shards": {
"total": 70,
"failed": 65,
"successful": 5
}
}
]
}

So few different error, and don't find anything about a "engine is close" on web...

For informations, we have others snapshot done everyday on others indices, and we never get error on them.

I don't find why on these indices, and a solution to be able to it with success.

More informations:
elasticsearch / kibana / logstash: 6.3.0
curator (via pip elasticsearch-curator): 5.5.4
System: Centos7
Repository for snapshot: mountpoint nfs with compress at true.

I try ro tun another snapshot to see if it fix himself, but it still running.

Somebody already get these error ? And have a solution ?

Thanks a lot for your help !
Mouglou

Sorry about the </>, for my snapshot output, but I try the 2 options, by using </> or // at te beginning of the line, but it seems to not be working... I will try again.

engine is closed suggests the primary shard performing the snapshot failed for some reason, and if so there should be a lot more detail in the server logs. Could you share any messages and stack traces from the logs from around this time?

Hi @DavidTurner !

Thanks for the feedback. I have one too.

I do a second snapshot, to follow this and it was done with success. No error.

But the first one, I try do to it 3 times and get the same error each time. So it like the first snapshot of these data seems complicated.

I find these error in the elasticsearch log:

[2018-12-11T05:59:41,404][WARN ][o.e.s.SnapshotShardsService] [ord-es01-001-es01-dev][[metrics-2018.11.14][10]][repo_es01-dev:snapshot-poc-metrics-2018.12.10/PGOWeVx5TeKWxcWt5ACayA] failed to snapshot shard org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to snapshot at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:419) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:97) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161] Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:1980) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.index.shard.IndexShard.acquireLastIndexCommit(IndexShard.java:1094) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:409) ~[elasticsearch-6.3.0.jar:6.3.0] ... 7 more

[2018-12-11T05:59:41,415][WARN ][o.e.s.SnapshotShardsService] [ord-es01-001-es01-dev] [[metrics-2018.11.20][5]][repo_es01-dev:snapshot-poc-metrics-2018.12.10/PGOWeVx5TeKWxcWt5ACayA] failed to snapshot shard org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:819) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:410) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:97) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161] Caused by: org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0] at org.elasticsearch.common.util.concurrent.AbstractRefCounted.alreadyClosed(AbstractRefCounted.java:71) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.AbstractRefCounted.incRef(AbstractRefCounted.java:42) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.index.store.Store.incRef(Store.java:398) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1158) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:813) ~[elasticsearch-6.3.0.jar:6.3.0] ... 8 more

[2018-12-11T05:59:41,620][WARN ][o.e.s.SnapshotShardsService] [ord-es01-001-es01-dev] [[metrics-2018.10.29][0]][repo_es01-dev:snapshot-poc-metrics-2018.12.10/PGOWeVx5TeKWxcWt5ACayA] failed to snapshot shard org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1354) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1347) ~[elasticsearch-6.3.0.jar:6.3.0] at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_161] at org.elasticsearch.core.internal.io.Streams.copy(Streams.java:54) ~[elasticsearch-core-6.3.0.jar:6.3.0] at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:131) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1281) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1217) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:813) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:410) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:97) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

Each error repeat multiple time for different indices.

Not speaking at all for me these error !
Mouglou

:slight_smile: Nor me in this case. They still tell me that the shard is unexpectedly closed during the snapshot, but not why. Are there any other messages? Is your cluster otherwise healthy?

I check better the logs, and i found these:

[2018-12-11T05:58:42,660][WARN ][o.e.t.n.Netty4Transport ] [ord-es01-001-es01-dev] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.4.32.98:58480}] java.nio.channels.ClosedChannelException: null at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

[2018-12-11T05:58:42,668][WARN ][o.e.t.n.Netty4Transport ] [ord-es01-001-es01-dev] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.4.32.96:52227}] java.nio.channels.ClosedChannelException: null at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

[2018-12-11T05:58:42,668][WARN ][o.e.t.n.Netty4Transport ] [ord-es01-001-es01-dev] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.4.32.96:52227}] java.nio.channels.ClosedChannelException: null at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

[2018-12-11T05:59:11,962][INFO ][o.e.d.z.ZenDiscovery ] [ord-es01-001-es01-dev] master_left [{ord-esmaster01-003-es01-dev}{43KrSM-eTkyQ1czSeYOWbA}{-wr405ErR522z8j_tc1Hug}{10.4.32.105}{10.4.32.105:9300}{ml.machine_memory=8894873600, ml. max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

[2018-12-11T05:59:11,962][WARN ][o.e.d.z.ZenDiscovery ] [ord-es01-001-es01-dev] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes: {ord-esmaster01-003-es01-dev}{43KrSM-eTkyQ1czSeYOWbA}{-wr405ErR522z8j_tc1Hug}{10.4.32.105}{10.4.32.105:9300}{ml.machine_memory=8894873600, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master {ord-esmaster01-002-es01-dev}{rl6VPdkqSzyyVXezwXERNA}{itfFyYgjR5CG4TK98vdr8A}{10.4.32.104}{10.4.32.104:9300}{ml.machine_memory=8895926272, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-esclient01-002-es01-dev}{PU9oF_XOQ_CwHekzYs9Ltw}{olPMeNL0RZWcX2lA3BYC5w}{10.4.32.110}{10.4.32.110:9300}{ml.machine_memory=4667203584, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-006-es01-dev}{RUxNImDGTIySQv9rVV3v3w}{ePhniAdOQUKOVpsLYGQLkA}{10.4.32.100}{10.4.32.100:9300}{rack_id=K13-4, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-esmaster01-001-es01-dev}{ry1Nws6SRImi4c8h7FUdSg}{Tyl_DT8ETi2Nb-pfSIA89Q}{10.4.32.103}{10.4.32.103:9300}{ml.machine_memory=8504516608, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-esclient01-001-es01-dev}{Xzkpd09BS4ubsMGl6tV3ww}{JIlkvEJtR-2KbV_QgAvahg}{10.4.32.109}{10.4.32.109:9300}{ml.machine_memory=4667203584, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-005-es01-dev}{ZGpgWzA2QKSFaray7lQcvA}{FIA5EOEERDaX5zixQm7AtQ}{10.4.32.99}{10.4.32.99:9300}{rack_id=K13-8, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-003-es01-dev}{ApCInEgCQ1aSmwYEr39k_w}{RbLw4dorS9utubmiCE_OEw}{10.4.32.97}{10.4.32.97:9300}{rack_id=K13-2, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-002-es01-dev}{e7zW4W1vRT6OBk-4DcZFZw}{PvqMK4sFQyO2_yjPbE_fbQ}{10.4.32.96}{10.4.32.96:9300}{rack_id=K13-1, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-001-es01-dev}{NSuWs702Q3Wr0OqzygzGZQ}{xSzQ4QkPT8GikNf8W9A91A}{10.4.32.95}{10.4.32.95:9300}{rack_id=K13-1, ml.machine_memory=67254702080, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local {ord-es01-007-es01-dev}{4Se9sK1VQpikeFEPf0QirQ}{ZXg4_H37TWKFfX_0Cd-c8Q}{10.4.32.101}{10.4.32.101:9300}{rack_id=K13-5, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-008-es01-dev}{6qPQtTn9SfOwBsNtAI0YpQ}{Idr8YzwfQwGoNrfmHPXKkQ}{10.4.32.102}{10.4.32.102:9300}{rack_id=K13-6, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-004-es01-dev}{m4su0lvxSU67nKv1Hw68Ug}{xLPER63fQYuDJNmL65uccg}{10.4.32.98}{10.4.32.98:9300}{rack_id=K13-2, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}

[2018-12-11T05:59:11,980][INFO ][o.e.x.w.WatcherService ] [ord-es01-001-es01-dev] stopping watch service, reason [no master node]

[2018-12-11T05:59:17,203][INFO ][o.e.c.s.ClusterApplierService] [ord-es01-001-es01-dev] detected_master {ord-esmaster01-003-es01-dev}{43KrSM-eTkyQ1czSeYOWbA}{-wr405ErR522z8j_tc1Hug}{10.4.32.105}{10.4.32.105:9300}{ml.machine_memory=88948736 00, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {ord-esmaster01-003-es01-dev}{43KrSM-eTkyQ1czSeYOWbA}{-wr405ErR522z8j_tc1Hug}{10.4.32.105}{10.4.32.105:9300}{ml.machine_memor y=8894873600, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [26718]])

As we see, we maybe encount a small connection lost on the master node, but the same node as been reelected as master node...

We have 3 master node, if one failed, the service moved to another node, and the service should continue. Is possible that the reason why the snapshot failed ?

I think the master didn't fail here, but this node (ord-es01-001-es01-dev) lost its connection to the master for long enough that the master decided to remove it from the cluster, cancelling the snapshot so it could allocate a new primary.

I suggest you look into that connectivity issue further. Elasticsearch is designed not to lose any data if the network is temporarily broken, but it's not designed to work well on a consistently unreliable network.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.