Unforseen data loss in ES Cluster. Need help recovering it!

Our scenario below.

Self-Managed ES 3-node cluster on AWS EC2. How can we recover this data loss described below? Can your team of experts help?

  1. The loss of all user data was confirmed by looking at elasticsearch and seeing that the User index is empty
  2. There is a lambda job that runs every hour that backs up all database tables that has not operated properly since September 2019 (every table except the user table has its data backed up to S3)
  • It remains unclear why this is the case. Nothing has changed with the job nor with the User index in elasticsearch, so something else has changed.
  • This fix should be priority #1 after the User table is back to normal.
  1. We checked the lambda job's logs to find the period when the user table became empty: sometime between 6:38 and 7:38 UTC (log lines below)
  2. We checked the elasticsearch logs to see if there was any visible cause for the data loss. There was none.
  3. We did find that the index was recreated at 10:50 UTC. Since it is a new index, the size of the index is 0 bytes, as expected. (log lines below)
  4. There is no smoking gun for why the User index was deleted

Lambda Job Logs (evidence of lost)

  • 6:38UTC - 802751f5-46bd-4b17-b56a-f20d98ac79e1 Snapshotting users_2018-03-23 - count: 1112408, size: 575540262
  • 7:38UTC - 7ac10956-18e8-4f67-83ee-e6fa9102a112 Snapshotting users_2018-03-23 - count: 0, size: 1590

Elasticsearch Logs (Index Recreated)

  • [2020-03-25 10:50:06,595][INFO ][cluster.metadata ] [ip-10-2-3-179] [users_2018-03-23] creating index, cause [api], templates , shards [5]/[1], mappings [role, user, token]
  • [2020-03-25 10:50:07,248][INFO ][cluster.routing.allocation] [ip-10-2-3-179] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[users_2018-03-23][4], [users_2018-03-23][2], [users_2018-03-23][0], [users_2018-03-23][3], [users_2018-03-23][1]] ...]).
  • [2020-03-25 10:50:08,239][INFO ][cluster.routing.allocation] [ip-10-2-3-179] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[users_2018-03-23][3], [users_2018-03-23][1], [users_2018-03-23][4]] ...]).

If the index was explicitly deleted it should be in the logs on one of the nodes.

Which version of Elasticsearch are you using? Is the cluster correctly configured? Was it only one index that was deleted?

We are using an older version. 2.x os something. Can confirm with my tech team.
Not sure if cluster is configured correctly. Can use some help. Yes one index (Users) was deleted.

I'm a bit confused by:

There is a lambda job that runs every hour that backs up all database tables that has not operated properly since September 2019 (every table except the user table has its data backed up to S3)

Your logs suggest that this has actually been backed up:

6:38UTC - 802751f5-46bd-4b17-b56a-f20d98ac79e1 Snapshotting users_2018-03-23 - count: 1112408, size: 575540262

But it is not part of the backup in the end?

We can't find any backups since 9/1/19. Not sure why they are missing. Everything seems normal. Can's see the log files for the ES cache to see if there was an explicit delete statement? Where does ES write its logs?

Need some skills to look into the environment and help recover if possible

The logging directory will depend on the installation mechanism you have used (DEB / RPM will behave differently than unpacking a TAR.GZ for example), but I would start looking in /var/log/elasticsearch/.

If you don't have a snapshot and deleted the data on all the nodes, maybe you can retrieve the data from the filesystem (if it hasn't been overwritten yet). But this is more of an operating system problem. Otherwise no amount of skill will be able to recover that data.

If you can not find any evidence in the logs that the index was deliberately deleted (not sure if this has always been looged or not and you are running a very, very old version) it is worth verifying that your cluster is correctly configured and that minimum_master_nodes is set to 2 (assuming all your 3 nodes are master eligible). Incorrect configuration can lead to split-brain and data loss. This is perhaps less likely though given that only a single index seem to have been deleted.