Unforseen data loss in ES Cluster. Need help recovering it!

Vinay_Kumar3 · March 26, 2020, 6:21pm

Our scenario below.

Self-Managed ES 3-node cluster on AWS EC2. How can we recover this data loss described below? Can your team of experts help?

The loss of all user data was confirmed by looking at elasticsearch and seeing that the User index is empty
There is a lambda job that runs every hour that backs up all database tables that has not operated properly since September 2019 (every table except the user table has its data backed up to S3)

It remains unclear why this is the case. Nothing has changed with the job nor with the User index in elasticsearch, so something else has changed.
This fix should be priority #1 after the User table is back to normal.

We checked the lambda job's logs to find the period when the user table became empty: sometime between 6:38 and 7:38 UTC (log lines below)
We checked the elasticsearch logs to see if there was any visible cause for the data loss. There was none.
We did find that the index was recreated at 10:50 UTC. Since it is a new index, the size of the index is 0 bytes, as expected. (log lines below)
There is no smoking gun for why the User index was deleted

Lambda Job Logs (evidence of lost)

6:38UTC - 802751f5-46bd-4b17-b56a-f20d98ac79e1 Snapshotting users_2018-03-23 - count: 1112408, size: 575540262
7:38UTC - 7ac10956-18e8-4f67-83ee-e6fa9102a112 Snapshotting users_2018-03-23 - count: 0, size: 1590

Elasticsearch Logs (Index Recreated)

[2020-03-25 10:50:06,595][INFO ][cluster.metadata ] [ip-10-2-3-179] [users_2018-03-23] creating index, cause [api], templates , shards [5]/[1], mappings [role, user, token]
[2020-03-25 10:50:07,248][INFO ][cluster.routing.allocation] [ip-10-2-3-179] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[users_2018-03-23][4], [users_2018-03-23][2], [users_2018-03-23][0], [users_2018-03-23][3], [users_2018-03-23][1]] ...]).
[2020-03-25 10:50:08,239][INFO ][cluster.routing.allocation] [ip-10-2-3-179] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[users_2018-03-23][3], [users_2018-03-23][1], [users_2018-03-23][4]] ...]).

Christian_Dahlqvist · March 26, 2020, 7:57pm

If the index was explicitly deleted it should be in the logs on one of the nodes.

Which version of Elasticsearch are you using? Is the cluster correctly configured? Was it only one index that was deleted?

Vinay_Kumar3 · March 26, 2020, 8:02pm

We are using an older version. 2.x os something. Can confirm with my tech team.
Not sure if cluster is configured correctly. Can use some help. Yes one index (Users) was deleted.

xeraa · March 27, 2020, 8:12pm

I'm a bit confused by:

There is a lambda job that runs every hour that backs up all database tables that has not operated properly since September 2019 (every table except the user table has its data backed up to S3)

Your logs suggest that this has actually been backed up:

6:38UTC - 802751f5-46bd-4b17-b56a-f20d98ac79e1 Snapshotting users_2018-03-23 - count: 1112408, size: 575540262

But it is not part of the backup in the end?

Vinay_Kumar3 · March 27, 2020, 8:30pm

We can't find any backups since 9/1/19. Not sure why they are missing. Everything seems normal. Can's see the log files for the ES cache to see if there was an explicit delete statement? Where does ES write its logs?

Need some skills to look into the environment and help recover if possible

xeraa · March 28, 2020, 2:30am

The logging directory will depend on the installation mechanism you have used (DEB / RPM will behave differently than unpacking a TAR.GZ for example), but I would start looking in /var/log/elasticsearch/.

If you don't have a snapshot and deleted the data on all the nodes, maybe you can retrieve the data from the filesystem (if it hasn't been overwritten yet). But this is more of an operating system problem. Otherwise no amount of skill will be able to recover that data.

Christian_Dahlqvist · March 28, 2020, 11:39am

If you can not find any evidence in the logs that the index was deliberately deleted (not sure if this has always been looged or not and you are running a very, very old version) it is worth verifying that your cluster is correctly configured and that minimum_master_nodes is set to 2 (assuming all your 3 nodes are master eligible). Incorrect configuration can lead to split-brain and data loss. This is perhaps less likely though given that only a single index seem to have been deleted.

system · April 25, 2020, 11:39am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Some data lost after cluster restart Elasticsearch	2	611	July 5, 2017
Indices got deleted suddenly Elasticsearch	9	1607	February 1, 2020
Deleted cluster,cant restore index from non-ES snapshot (disk backup) Elasticsearch	9	1361	July 5, 2017
Indexes deleted (empty) on cluster restart Elasticsearch	14	1659	July 6, 2017
Cluster partition resulted in loss of data Elasticsearch	5	426	July 6, 2017

Unforseen data loss in ES Cluster. Need help recovering it!

Related topics