Hang up while creating a snapshot


(Marco Spadoni) #1

Hi all.
I am running a six-node cluster featuring bare-metal Lenovo Thinkserver RD550 MLK, equipped with 128 GB RAM, 2 CPU 8-core Xeon E5-2620, and 8 x 800 GB SSD in RAID 10. The six nodes mount an NFS slice configured as Elastic snapshots repository.
Installed Elasticsearch version is 5.1.1
Every day we create a snapshot for an index covering a day.
The cluster is continously fed by log messages (via logstash, consuming a Kafka topic) at a mean-rate of about 15000 msgs/sec and a peak rate of about 25000 msgs/sec.
Occasionally the snapshot production process hangs-up as can be seen in this excerpt of the related log file:

Started index snapshot (snap-cm_sp_ops-20161202) creation task ...
Snapshot creation task status check 001; task still running at 2017-01-16.14:33:07 {"Total_files":772,"Processed_files":80,"Processed_bytes":33939100845} ...
Snapshot creation task status check 002; task still running at 2017-01-16.14:38:08 {"Total_files":772,"Processed_files":154,"Processed_bytes":66420738557} ...
Snapshot creation task status check 003; task still running at 2017-01-16.14:43:08 {"Total_files":772,"Processed_files":262,"Processed_bytes":101612034931} ...
Snapshot creation task status check 004; task still running at 2017-01-16.14:48:08 {"Total_files":772,"Processed_files":340,"Processed_bytes":134351782817} ...
Snapshot creation task status check 005; task still running at 2017-01-16.14:53:08 {"Total_files":772,"Processed_files":404,"Processed_bytes":157301923196} ...
Snapshot creation task status check 006; task still running at 2017-01-16.14:58:09 {"Total_files":772,"Processed_files":488,"Processed_bytes":181146599883} ...
Snapshot creation task status check 007; task still running at 2017-01-16.15:03:09 {"Total_files":772,"Processed_files":543,"Processed_bytes":202853102074} ...
Snapshot creation task status check 008; task still running at 2017-01-16.15:08:09 {"Total_files":772,"Processed_files":622,"Processed_bytes":230926406574} ...
Snapshot creation task status check 009; task still running at 2017-01-16.15:13:09 {"Total_files":772,"Processed_files":643,"Processed_bytes":243061286765} ...
Snapshot creation task status check 010; task still running at 2017-01-16.15:18:10 {"Total_files":772,"Processed_files":643,"Processed_bytes":243061286765} ...
Snapshot creation task status check 011; task still running at 2017-01-16.15:23:10 {"Total_files":971,"Processed_files":662,"Processed_bytes":245992908061} ...
Snapshot creation task status check 012; task still running at 2017-01-16.15:28:10 {"Total_files":971,"Processed_files":682,"Processed_bytes":257730086837} ...
Snapshot creation task status check 013; task still running at 2017-01-16.15:33:10 {"Total_files":971,"Processed_files":720,"Processed_bytes":266745978206} ...
Snapshot creation task status check 014; task still running at 2017-01-16.15:38:10 {"Total_files":971,"Processed_files":736,"Processed_bytes":272574285680} ...
Snapshot creation task status check 015; task still running at 2017-01-16.15:43:10 {"Total_files":971,"Processed_files":777,"Processed_bytes":285742203971} ...
Snapshot creation task status check 016; task still running at 2017-01-16.15:48:10 {"Total_files":971,"Processed_files":804,"Processed_bytes":292742580981} ...
Snapshot creation task status check 017; task still running at 2017-01-16.15:53:10 {"Total_files":971,"Processed_files":830,"Processed_bytes":304755046059} ...
Snapshot creation task status check 018; task still running at 2017-01-16.15:58:10 {"Total_files":971,"Processed_files":842,"Processed_bytes":312912373895} ...
Snapshot creation task status check 019; task still running at 2017-01-16.16:03:10 {"Total_files":971,"Processed_files":842,"Processed_bytes":312912373895} ...
Snapshot creation task status check 020; task still running at 2017-01-16.16:08:10 {"Total_files":971,"Processed_files":842,"Processed_bytes":312912373895} ...
... ... ... ... ...
Snapshot creation task status check 236; task still running at 2017-01-17.10:08:38 {"Total_files":971,"Processed_files":842,"Processed_bytes":312912373895} ...

What I need currently is to know which indicators (and how) need to be gathered to help the investigation of such an event.
I attach below some commands I ran toward the cluster, and their output (if can be of help). The snapshot name is "snap-cm_sp_ops-20161202" and it is related to the index named "cm_sp_ops-2016.12.02".

curl --silent -X GET http://oxelkk-05.iol.local:9200/_snapshot/snaprepo/_all | jq '.snapshots[]'

{
"snapshot": "snap-cm_sp_ops-20161202",
"uuid": "OFCzvripR0igt0FUP_MDhQ",
"version_id": 5010199,
"version": "5.1.1",
"indices": [
"cm_sp_ops-2016.12.02"
],
"state": "IN_PROGRESS",
"start_time": "2017-01-16T13:28:07.308Z",
"start_time_in_millis": 1484573287308,
"failures": [],
"shards": {
"total": 0,
"failed": 0,
"successful": 0
}
}

curl --silent -XGET http://oxelkk-05.iol.local:9200/_snapshot/snaprepo/snap-cm_sp_ops-20161202/_status | jq '.snapshots[].indices[].shards[].stage'

"STARTED"
"DONE"
"DONE"
"DONE"
"DONE"

curl --silent -XGET http://oxelkk-05.iol.local:9200/_cluster/state | jq '.snapshots.snapshots[].shards[].state'

"SUCCESS"
"INIT"
"SUCCESS"
"SUCCESS"
"SUCCESS"

Thank you for your hints,
Marco


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.