Starts snapshot from the begining

Hi

We use a graylog with 10 elastic node backend.
ES version: elasticsearch-6.6.0 (same problem with prev versions also)
Snapshot repo: local "fs" repo, NFS mounted to a local folder
OS: CentOS 7
10 elastic nodes in one cluster, no set roles.

The problem:
We run snapshots every night from a cron job. Before the job, it delete the old ones. So we have a constant disk usage. We make snapshots per index.
But if we update a host OS (ES version not changes), and reboot it (eg. kernel update) the night snapshot jobs starts, and it eat all space, make all snapshots from the begining. So write the data to the disk again.
If we don't update nodes, it can run months without any problem.

ES starts automatically, but the NFS mounts with hand. So after the restart the ES starts with empty repo. We also tried to restart ES after the mount.

I tried to check logs, but I didn't see any errors, but there is a lot of logs, so I'm not sure, I'm right.
The restarted nodes' logs are empty at the time when the snapshots starts.

Have you got any idea where to start the debugging? Or have you seen same error before?

Thanks, Macko

I did a little debug, and I found some new information.
The process creates the snapshots from the beginning where the the restarted node (service restart enough) has the replica shard of the index.

I tried to set the following, but I get only INFO level logs about the snapshots on the cluster master node's log.

{
  "transient": {
    "logger.org.elasticsearch.snapshot": "trace"
  }
}
[2019-08-29T11:21:37,519][INFO ][o.e.s.SnapshotsService   ] [elastic_b1] snapshot [my_backup:XXX_603-20190829-112106/XXX] started
[2019-08-29T11:21:37,620][INFO ][o.e.s.SnapshotsService   ] [elastic_b1] snapshot [my_backup:XXX_603-20190829-112106/XXX] completed with state [SUCCESS]

I don't completely understand the issue you're trying to describe, but this one line stands out as a bad idea, for two reasons:

  1. between deleting one snapshot and completing the next you have no snapshot. If something fails in that period then you have lost your data.

  2. snapshots are incremental, so the second snapshot is usually much quicker than the first.

For both of these reasons you should delete a snapshot after creating the next one, not before.

Thanks, You are right, I was unclear.

So delete the old snapshots not all snapshots. The frontend app rotates the indexes, and I delete the 5 day old snapshots. So on the current active indices I have 4-5 snapshots.
Eg.: Index A - snapshot 0901,0902,0903,0904 (date, month, day)...

And my problem. With your word it is more simple to tell it.
The snapshots working well, it does a full at first time, and incremental after. Because the frontend rotate the active write index, most of the snapshots runs only 1-2 secs.
BUT if I restart one server, when I run my script, it does incremental snapshots.
EXCEPT the restarted server's indices (where one replica shard of an index is on the restarted server), where the elastic start a Full snapshot instead an incremental one.

Ok, I think I understand a little better.

This seems bad. I'm pretty sure strange things will happen if you start Elasticsearch before mounting the repository.

Can you try to verify your repository before starting the snapshot? I.e. run POST /_snapshot/$REPOSITORY_NAME/_verify first and check for success.

//the system does the same error when I only restarted the elasticsearch service, so the NFS was available.

It seems the verify was the solution.
I ran it manually, and the script takes the incremental snapshots only.

Thank you!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.