Hi all !
We are using elasticsearch for a long time now and we get an error during a snapshot.
These snapshot storing metrics (1 indice by day / 12 shards), so they are pretty big: ~180G / 900M of documents.
Until now, we don't do a snapshot for these datas. But now, we have too.
The snapshot take a long time to be done, but we get a status "PARTIAL" at the end. If I don't add "PARTIAL" at "true" in curator config file, snapshot is failing cause some shard don't have primary status (which is wrong, cluster is green, and no yellow/red shard in monitoring)
Here the (partial) status of the snapshot: (error repeat multiple time)
{
"snapshots": [
{
"snapshot": "snapshot-poc-metrics-2018.12.10",
"uuid": "PGOWeVx5TeKWxcWt5ACayA",
"version_id": 6030099,
"version": "6.3.0",
"indices": [
"metrics-2018.11.30",
"metrics-2018.08.08",
.... (for a total of 140 indices)
],
"include_global_state": true,
"state": "PARTIAL",
"start_time": "2018-12-10T16:09:10.421Z",
"start_time_in_millis": 1544458150421,
"end_time": "2018-12-11T06:07:06.603Z",
"end_time_in_millis": 1544508426603,
"duration_in_millis": 50276182,
"failures": [
{
"index": "metrics-2018.11.30",
"index_uuid": "metrics-2018.11.30",
"shard_id": 6,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.19",
"index_uuid": "metrics-2018.11.19",
"shard_id": 6,
"reason": "IndexShardSnapshotFailedException[Aborted]",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.09.16",
"index_uuid": "metrics-2018.09.16",
"shard_id": 4,
"reason": "IndexShardSnapshotFailedException[java.lang.IllegalStateException: Unable to move the shard snapshot status to [FINALIZE]: expecting [STARTED] but got [ABORTED]]; nested: IllegalStateException[Unable to move the shard snapshot status to [FINALIZE]: expecting [STARTED] but got [ABORTED]]; ",
"node_id": "ApCInEgCQ1aSmwYEr39k_w",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.10",
"index_uuid": "metrics-2018.11.10",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Aborted]",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.08.24",
"index_uuid": "metrics-2018.08.24",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.20",
"index_uuid": "metrics-2018.11.20",
"shard_id": 5,
"reason":"IndexShardSnapshotFailedException[org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0]]; nested: AlreadyClosedException[store is already closed can't increment refCount current count [0]]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.10.12",
"index_uuid": "metrics-2018.10.12",
"shard_id": 0,
"reason": "IndexShardSnapshotFailedException[Failed to snapshot]; nested: AlreadyClosedException[engine is closed]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
{
"index": "metrics-2018.11.29",
"index_uuid": "metrics-2018.11.29",
"shard_id": 3,
"reason":
"IndexShardSnapshotFailedException[org.apache.lucene.store.AlreadyClosedException: store is already closed can't increment refCount current count [0]]; nested: AlreadyClosedException[store is already closed can't increment refCount current count [0]]; ",
"node_id": "NSuWs702Q3Wr0OqzygzGZQ",
"status": "INTERNAL_SERVER_ERROR"
},
.....
],
"shards": {
"total": 70,
"failed": 65,
"successful": 5
}
}
]
}
So few different error, and don't find anything about a "engine is close" on web...
For informations, we have others snapshot done everyday on others indices, and we never get error on them.
I don't find why on these indices, and a solution to be able to it with success.
More informations:
elasticsearch / kibana / logstash: 6.3.0
curator (via pip elasticsearch-curator): 5.5.4
System: Centos7
Repository for snapshot: mountpoint nfs with compress at true.
I try ro tun another snapshot to see if it fix himself, but it still running.
Somebody already get these error ? And have a solution ?
Thanks a lot for your help !
Mouglou