Feasible solution of snapshot for 30 GB and increasing every day data

I want to know that how can I take differential snapshots in ElasticSearch and also how it works?

We received around 30 GB of data monthly in all indices of ElasticSearch. Few indices get update daily and few indices data get purge after certain retention days. So I was thinking to go with incremental snapshot so that it will not take time and only modified data will get into a snapshot. But I don't know how it works and will it be feasible for my case?

Could you please help me to design a snapshot process so that it can work permanently and will not be impacted with time.

1 Like

Hi @priyanka10

all snapshots to a single repository are incremental. You will not have to do anything specifically to take differential snapshots. If you just take snapshots to the same repository, each snapshot will try to reuse as much data as possible from prior snapshots automatically.

Thanks Armin for reply, could you please help us to understand your statement:

"If you just take snapshots to the same repository, each snapshot will try to reuse as much data as possible from prior snapshots automatically."

What we are understanding, suppose there is one index which has initially three records as below.

    {
            "_index": "test_inx",
            "_type": "doc",
            "_id": "2",
            "_score": 1,
            "_source": {
              "Empid": "2",
              "Name": "BCD"
            }
          },
          {
            "_index": "test_inx",
            "_type": "doc",
            "_id": "1",
            "_score": 1,
            "_source": {
              "Empid": "1",
              "Name": "ABC"
            }
          },
          {
            "_index": "test_inx",
            "_type": "doc",
            "_id": "3",
            "_score": 1,
            "_source": {
              "Empid": "3",
              "Name": "EFG"
            }

Now, we take a snapshot of above index into snapshot-1.
On next day, few new records get insert (Records with EmpId with 4 & 5) into index and one record get update (record with EmpId 2) so final stage of index will be as below.

{
        "_index": "priority_inx",
        "_type": "doc",
        "_id": "2",
        "_score": 1,
        "_source": {
          "Empid": "2",
          "Name": "NEW_BCD"
        }
      },
      {
        "_index": "priority_inx",
        "_type": "doc",
        "_id": "1",
        "_score": 1,
        "_source": {
          "Empid": "1",
          "Name": "ABC"
        }
      },
      {
        "_index": "priority_inx",
        "_type": "doc",
        "_id": "3",
        "_score": 1,
        "_source": {
          "Empid": "3",
          "Name": "EFG"
        }
     {
        "_index": "priority_inx",
        "_type": "doc",
        "_id": "4",
        "_score": 1,
        "_source": {
          "Empid": "4",
          "Name": "LMN"
        }
     {
        "_index": "priority_inx",
        "_type": "doc",
        "_id": "5",
        "_score": 1,
        "_source": {
          "Empid": "5",
          "Name": "XYZ"
        }

Next, we take new snapshot into snapshot-2 within same repository.

Now, my concern is what happen in background, since records with EmpId 1 and 3 are as it is in both snapshots so are they again restore into snapshot-2? or as per your statement "it use as much data from previous snapshot" then will it take these two records from snapshot-1 when we will try to restore snapshot-2?

if it stored all records into snapshot-2 then is it possible to delete snapshot-1 since with time repository size will be increase if it keep same records multiple times?

First off , the unit of snapshotting isn't individual documents but rather Lucene segments, which can be roughly interpreted as a group of documents. So the level of incrementally isn't that granular. Still, the points your question raises remain the same logically, just figured I'd point this out:

Yes roughly (if there are Lucene merges in the meantime or primary failover this may not always work) this is true, unchanged segments/documents will be reused in cases like this one and the segment containing document 1 and 3 will not be re-uploaded to the repository.

Yes, you can delete snapshots as you see fit. The snapshot functionality will then simply remove the data that was only referenced by the deleted snapshot (in your example snapshot-1) but will leave the data that is still required for other snapshots in the repository. The repository will not needlessly keep files around that aren't used by any snapshots.

1 Like

if you want to automate this from cron level check this out as well

https://discuss.elastic.co/t/automatic-elasticsearch-snapshot-backup/173223

Thank you Armin, Now it is clear :slightly_smiling_face:

sure we will check it and let you know in case any concerns.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.