Self-Managed ES 3-node cluster on AWS EC2. How can we recover this data loss described below? Can your team of experts help?
The loss of all user data was confirmed by looking at elasticsearch and seeing that the User index is empty
There is a lambda job that runs every hour that backs up all database tables that has not operated properly since September 2019 (every table except the user table has its data backed up to S3)
It remains unclear why this is the case. Nothing has changed with the job nor with the User index in elasticsearch, so something else has changed.
This fix should be priority #1 after the User table is back to normal.
We checked the lambda job's logs to find the period when the user table became empty: sometime between 6:38 and 7:38 UTC (log lines below)
We checked the elasticsearch logs to see if there was any visible cause for the data loss. There was none.
We did find that the index was recreated at 10:50 UTC. Since it is a new index, the size of the index is 0 bytes, as expected. (log lines below)
There is no smoking gun for why the User index was deleted
[2020-03-25 10:50:07,248][INFO ][cluster.routing.allocation] [ip-10-2-3-179] Cluster health status changed from [RED] to [YELLOW] (reason: [shards started [[users_2018-03-23][4], [users_2018-03-23][2], [users_2018-03-23][0], [users_2018-03-23][3], [users_2018-03-23][1]] ...]).
[2020-03-25 10:50:08,239][INFO ][cluster.routing.allocation] [ip-10-2-3-179] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[users_2018-03-23][3], [users_2018-03-23][1], [users_2018-03-23][4]] ...]).
We are using an older version. 2.x os something. Can confirm with my tech team.
Not sure if cluster is configured correctly. Can use some help. Yes one index (Users) was deleted.
There is a lambda job that runs every hour that backs up all database tables that has not operated properly since September 2019 (every table except the user table has its data backed up to S3)
Your logs suggest that this has actually been backed up:
We can't find any backups since 9/1/19. Not sure why they are missing. Everything seems normal. Can's see the log files for the ES cache to see if there was an explicit delete statement? Where does ES write its logs?
Need some skills to look into the environment and help recover if possible
The logging directory will depend on the installation mechanism you have used (DEB / RPM will behave differently than unpacking a TAR.GZ for example), but I would start looking in /var/log/elasticsearch/.
If you don't have a snapshot and deleted the data on all the nodes, maybe you can retrieve the data from the filesystem (if it hasn't been overwritten yet). But this is more of an operating system problem. Otherwise no amount of skill will be able to recover that data.
If you can not find any evidence in the logs that the index was deliberately deleted (not sure if this has always been looged or not and you are running a very, very old version) it is worth verifying that your cluster is correctly configured and that minimum_master_nodes is set to 2 (assuming all your 3 nodes are master eligible). Incorrect configuration can lead to split-brain and data loss. This is perhaps less likely though given that only a single index seem to have been deleted.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.