Snapshot failed with partial state

Mouglou · December 12, 2018, 8:02pm

Hi all !

We are using elasticsearch for a long time now and we get an error during a snapshot.

These snapshot storing metrics (1 indice by day / 12 shards), so they are pretty big: ~180G / 900M of documents.

Until now, we don't do a snapshot for these datas. But now, we have too.

The snapshot take a long time to be done, but we get a status "PARTIAL" at the end. If I don't add "PARTIAL" at "true" in curator config file, snapshot is failing cause some shard don't have primary status (which is wrong, cluster is green, and no yellow/red shard in monitoring)

Here the (partial) status of the snapshot: (error repeat multiple time)
{
"snapshots": [
{
"snapshot": "snapshot-poc-metrics-2018.12.10",
"uuid": "PGOWeVx5TeKWxcWt5ACayA",
"version_id": 6030099,
"version": "6.3.0",
"indices": [
"metrics-2018.11.30",
"metrics-2018.08.08",
.... (for a total of 140 indices)
],
"include_global_state": true,
"state": "PARTIAL",
"start_time": "2018-12-10T16:09:10.421Z",
"start_time_in_millis": 1544458150421,
"end_time": "2018-12-11T06:07:06.603Z",
"end_time_in_millis": 1544508426603,
"duration_in_millis": 50276182,
"failures": [
{
"index": "metrics-2018.11.30",
"index_uuid": "metrics-2018.11.30",
"shard_id": 6,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.19",
"index_uuid": "metrics-2018.11.19",
"shard_id": 6,
"reason": "IndexShardSnapshotFailedException[Aborted]",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.09.16",
"index_uuid": "metrics-2018.09.16",
"shard_id": 4,
"reason": "IndexShardSnapshotFailedException[java.lang.IllegalStateException: Unable to move the shard snapshot status to [FINALIZE]: expecting [STARTED] but got [ABORTED]]; nested: IllegalStateException[Unable to move the shard snapshot status to [FINALIZE]: expecting [STARTED] but got [ABORTED]]; ",
"node_id": "ApCInEgCQ1aSmwYEr39k_w",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.10",
"index_uuid": "metrics-2018.11.10",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Aborted]",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.08.24",
"index_uuid": "metrics-2018.08.24",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.20",
"index_uuid": "metrics-2018.11.20",
"shard_id": 5,
"reason":"IndexShardSnapshotFailedException[org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0]]; nested: AlreadyClosedException[store is already closed can't increment refCount current count [0]]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.10.12",
"index_uuid": "metrics-2018.10.12",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.29",
"index_uuid": "metrics-2018.11.29",
"shard_id": 3,
"reason":"IndexShardSnapshotFailedException[org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0]]; nested: AlreadyClosedException[store is already closed can't increment refCount current count [0]]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
.....
],
"shards": {
"total": 70,
"failed": 65,
"successful": 5
}
}
]
}

So few different error, and don't find anything about a "engine is close" on web...

For informations, we have others snapshot done everyday on others indices, and we never get error on them.

I don't find why on these indices, and a solution to be able to it with success.

More informations:
elasticsearch / kibana / logstash: 6.3.0
curator (via pip elasticsearch-curator): 5.5.4
System: Centos7
Repository for snapshot: mountpoint nfs with compress at true.

I try ro tun another snapshot to see if it fix himself, but it still running.

Somebody already get these error ? And have a solution ?

Thanks a lot for your help !
Mouglou

Mouglou · December 12, 2018, 8:06pm

Sorry about the </>, for my snapshot output, but I try the 2 options, by using </> or // at te beginning of the line, but it seems to not be working... I will try again.

DavidTurner · December 13, 2018, 1:28pm

engine is closed suggests the primary shard performing the snapshot failed for some reason, and if so there should be a lot more detail in the server logs. Could you share any messages and stack traces from the logs from around this time?

Mouglou · December 13, 2018, 3:43pm

Hi @DavidTurner !

Thanks for the feedback. I have one too.

I do a second snapshot, to follow this and it was done with success. No error.

But the first one, I try do to it 3 times and get the same error each time. So it like the first snapshot of these data seems complicated.

I find these error in the elasticsearch log:

[2018-12-11T05:59:41,404][WARN ][o.e.s.SnapshotShardsService] [ord-es01-001-es01-dev][[metrics-2018.11.14][10]][repo_es01-dev:snapshot-poc-metrics-2018.12.10/PGOWeVx5TeKWxcWt5ACayA] failed to snapshot shard org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to snapshot at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:419) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:97) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161] Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:1980) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.index.shard.IndexShard.acquireLastIndexCommit(IndexShard.java:1094) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:409) ~[elasticsearch-6.3.0.jar:6.3.0] ... 7 more

[2018-12-11T05:59:41,415][WARN ][o.e.s.SnapshotShardsService] [ord-es01-001-es01-dev] [[metrics-2018.11.20][5]][repo_es01-dev:snapshot-poc-metrics-2018.12.10/PGOWeVx5TeKWxcWt5ACayA] failed to snapshot shard org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:819) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:410) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:97) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161] Caused by: org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0] at org.elasticsearch.common.util.concurrent.AbstractRefCounted.alreadyClosed(AbstractRefCounted.java:71) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.AbstractRefCounted.incRef(AbstractRefCounted.java:42) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.index.store.Store.incRef(Store.java:398) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1158) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:813) ~[elasticsearch-6.3.0.jar:6.3.0] ... 8 more

[2018-12-11T05:59:41,620][WARN ][o.e.s.SnapshotShardsService] [ord-es01-001-es01-dev] [[metrics-2018.10.29][0]][repo_es01-dev:snapshot-poc-metrics-2018.12.10/PGOWeVx5TeKWxcWt5ACayA] failed to snapshot shard org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1354) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1347) ~[elasticsearch-6.3.0.jar:6.3.0] at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_161] at org.elasticsearch.core.internal.io.Streams.copy(Streams.java:54) ~[elasticsearch-core-6.3.0.jar:6.3.0] at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:131) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1281) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1217) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:813) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:410) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:97) ~[elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:354) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:724) [elasticsearch-6.3.0.jar:6.3.0] at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.3.0.jar:6.3.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]

Each error repeat multiple time for different indices.

Not speaking at all for me these error !
Mouglou

DavidTurner · December 13, 2018, 4:02pm

Nor me in this case. They still tell me that the shard is unexpectedly closed during the snapshot, but not why. Are there any other messages? Is your cluster otherwise healthy?

Mouglou · December 14, 2018, 6:34pm

I check better the logs, and i found these:

[2018-12-11T05:58:42,660][WARN ][o.e.t.n.Netty4Transport ] [ord-es01-001-es01-dev] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.4.32.98:58480}] java.nio.channels.ClosedChannelException: null at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

[2018-12-11T05:58:42,668][WARN ][o.e.t.n.Netty4Transport ] [ord-es01-001-es01-dev] send message failed [channel: NettyTcpChannel{localAddress=0.0.0.0/0.0.0.0:9300, remoteAddress=/10.4.32.96:52227}] java.nio.channels.ClosedChannelException: null at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source) ~[?:?]

[2018-12-11T05:59:11,962][INFO ][o.e.d.z.ZenDiscovery ] [ord-es01-001-es01-dev] master_left [{ord-esmaster01-003-es01-dev}{43KrSM-eTkyQ1czSeYOWbA}{-wr405ErR522z8j_tc1Hug}{10.4.32.105}{10.4.32.105:9300}{ml.machine_memory=8894873600, ml. max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [failed to ping, tried [3] times, each with maximum [30s] timeout]

[2018-12-11T05:59:11,962][WARN ][o.e.d.z.ZenDiscovery ] [ord-es01-001-es01-dev] master left (reason = failed to ping, tried [3] times, each with maximum [30s] timeout), current nodes: nodes: {ord-esmaster01-003-es01-dev}{43KrSM-eTkyQ1czSeYOWbA}{-wr405ErR522z8j_tc1Hug}{10.4.32.105}{10.4.32.105:9300}{ml.machine_memory=8894873600, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, master {ord-esmaster01-002-es01-dev}{rl6VPdkqSzyyVXezwXERNA}{itfFyYgjR5CG4TK98vdr8A}{10.4.32.104}{10.4.32.104:9300}{ml.machine_memory=8895926272, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-esclient01-002-es01-dev}{PU9oF_XOQ_CwHekzYs9Ltw}{olPMeNL0RZWcX2lA3BYC5w}{10.4.32.110}{10.4.32.110:9300}{ml.machine_memory=4667203584, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-006-es01-dev}{RUxNImDGTIySQv9rVV3v3w}{ePhniAdOQUKOVpsLYGQLkA}{10.4.32.100}{10.4.32.100:9300}{rack_id=K13-4, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-esmaster01-001-es01-dev}{ry1Nws6SRImi4c8h7FUdSg}{Tyl_DT8ETi2Nb-pfSIA89Q}{10.4.32.103}{10.4.32.103:9300}{ml.machine_memory=8504516608, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-esclient01-001-es01-dev}{Xzkpd09BS4ubsMGl6tV3ww}{JIlkvEJtR-2KbV_QgAvahg}{10.4.32.109}{10.4.32.109:9300}{ml.machine_memory=4667203584, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-005-es01-dev}{ZGpgWzA2QKSFaray7lQcvA}{FIA5EOEERDaX5zixQm7AtQ}{10.4.32.99}{10.4.32.99:9300}{rack_id=K13-8, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-003-es01-dev}{ApCInEgCQ1aSmwYEr39k_w}{RbLw4dorS9utubmiCE_OEw}{10.4.32.97}{10.4.32.97:9300}{rack_id=K13-2, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-002-es01-dev}{e7zW4W1vRT6OBk-4DcZFZw}{PvqMK4sFQyO2_yjPbE_fbQ}{10.4.32.96}{10.4.32.96:9300}{rack_id=K13-1, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-001-es01-dev}{NSuWs702Q3Wr0OqzygzGZQ}{xSzQ4QkPT8GikNf8W9A91A}{10.4.32.95}{10.4.32.95:9300}{rack_id=K13-1, ml.machine_memory=67254702080, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, local {ord-es01-007-es01-dev}{4Se9sK1VQpikeFEPf0QirQ}{ZXg4_H37TWKFfX_0Cd-c8Q}{10.4.32.101}{10.4.32.101:9300}{rack_id=K13-5, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-008-es01-dev}{6qPQtTn9SfOwBsNtAI0YpQ}{Idr8YzwfQwGoNrfmHPXKkQ}{10.4.32.102}{10.4.32.102:9300}{rack_id=K13-6, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} {ord-es01-004-es01-dev}{m4su0lvxSU67nKv1Hw68Ug}{xLPER63fQYuDJNmL65uccg}{10.4.32.98}{10.4.32.98:9300}{rack_id=K13-2, ml.machine_memory=67255390208, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}

[2018-12-11T05:59:11,980][INFO ][o.e.x.w.WatcherService ] [ord-es01-001-es01-dev] stopping watch service, reason [no master node]

[2018-12-11T05:59:17,203][INFO ][o.e.c.s.ClusterApplierService] [ord-es01-001-es01-dev] detected_master {ord-esmaster01-003-es01-dev}{43KrSM-eTkyQ1czSeYOWbA}{-wr405ErR522z8j_tc1Hug}{10.4.32.105}{10.4.32.105:9300}{ml.machine_memory=88948736 00, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}, reason: apply cluster state (from master [master {ord-esmaster01-003-es01-dev}{43KrSM-eTkyQ1czSeYOWbA}{-wr405ErR522z8j_tc1Hug}{10.4.32.105}{10.4.32.105:9300}{ml.machine_memor y=8894873600, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true} committed version [26718]])

As we see, we maybe encount a small connection lost on the master node, but the same node as been reelected as master node...

We have 3 master node, if one failed, the service moved to another node, and the service should continue. Is possible that the reason why the snapshot failed ?

DavidTurner · December 14, 2018, 7:24pm

I think the master didn't fail here, but this node (ord-es01-001-es01-dev) lost its connection to the master for long enough that the master decided to remove it from the cluster, cancelling the snapshot so it could allocate a new primary.

I suggest you look into that connectivity issue further. Elasticsearch is designed not to lose any data if the network is temporarily broken, but it's not designed to work well on a consistently unreliable network.

system · January 11, 2019, 7:24pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Action_snapshot error Elasticsearch	1	380	November 9, 2020
Elasticsearch snapshot failed Elasticsearch	1	619	October 13, 2018
Elasticsearch 1.2.2 snapshot throws error but still completes? Elasticsearch	2	484	July 6, 2017
Elasticsearch cluster snapshot failed with PARTIAL Elasticsearch	2	447	November 29, 2018
Elasticsearch snapshots fail everyday Elasticsearch slm-snapshot-lifecycle-management , snapshot-and-restore	6	740	July 12, 2023

Snapshot failed with partial state

Related topics