So I upgraded to 1.6.0 and all the data disappeared! Gone! Nada! Zilch!

javadevmtl · June 11, 2015, 7:04pm

Ok I'm in a test environment thankfully. Maybe how I upgraded?

I'm updating from 1.5.2 to 1.6.0
Running on Windows 2008 R2

The setup is 4 dedicated data nodes and 1 dedicated master (Plus running extra "sites" client node off master machine). That's all I have as machines.
I updated the data nodes first. Followed the recommended upgrade process. That went ok without a hitch!

So then I shut down the the master updated to 1.6.0 restarted it and boom all data was deleted! It's like it doesn't exist. it's not even on the drives! Gone like disappeared in thin air! 1 billion records down the drain!

Checked my RAID arrays just in case. They seem ok. But highly doubt that I would get 4 arrays dead in one shot lol.

I posted my master and 2 of the nodes logs. There doesn't seem to be anything out of the ordinary except for the master node waiting to get a state back from the nodes.

You will see when the data node came back up it new that 2.2TBs where being used of the .2TBs Now all drives on all nodes are basically empty!

nik9000 · June 11, 2015, 7:29pm

Having a single node with master=true seems like you are asking for trouble. That's no excuse for losing your data though.

I think this line from the master log is pretty telling:

[2015-06-11 14:31:49,289][INFO ][gateway                  ] [MY ES MASTER 01 (Master)] recovered [0] indices into cluster_state

Did you happen to delete the master's data directory as part of the upgrade process? I haven't checked the code, but I suspect with only a single master Elasticsearch will trust what is on it's master's data directory over the slaves - at least that would explain what you are seeing.

javadevmtl · June 11, 2015, 7:33pm

Unfortunately that's all I have in my test environment. I know it's not desired but worst case the cluster downgrades to basic mode right? I'm willing to accept that in test env.

For the master, I unzipped, 1.6.0 copied over the elasticsearch.yml, but not the Data folder from 1.5.2. But would this cause the nodes to wipe-out everything like that?

jrgns · June 12, 2015, 6:55am

Hey

Did you check the data folder? Is the data still on the disk, and Elasticsearch is not picking it up?

Is it possible that the 1.6 install is looking for the data in a different location?

dadoonet · June 12, 2015, 7:09am

Agreed. Also check that you are using same cluster name.

javadevmtl · June 12, 2015, 12:37pm

No, the actual data folder got wiped clean! Thats the weird part!

I did a rolling upgrade of all the data nodes no problem.

Then I finally shutdown the master, unzipped 1.6.0 for master, then copied elasticsearch.yml from 1.5.2 master and restarted. I forgot to copy data folder for master 1.5.2 to 1.6.0, but that's all. You can see the in the logs the data nodes had 2.5TB of data on startup.

dadoonet · June 12, 2015, 1:27pm

I'm not sure about what happened but IMO doing a rolling upgrade with one single master node could lead to errors.
If you don't have more than one master, then I'd do a full cluster upgrade.

I tried to simulate what you did with one single master node and two data only nodes.

Started the cluster with 1.5.2
Created an index with a doc
Stopped node1, upgrade it to 1.6.0 and restarted (same cluster name, same path...), wait for green
Stopped node2, upgrade it to 1.6.0 and restarted (same cluster name, same path...), wait for green
Stopped master, upgrade it to 1.6.0 and restarted (same cluster name, same path...)
GET my document back

Everything went well. So I have no idea about what happened in your case.
Are you sure you waited that all shards/indices were correctly restarted for each step?

jprante · June 12, 2015, 1:27pm

But that is the problem. You had a single master and did not transport the cluster state to the new version. Then it must be gone.

dadoonet · June 12, 2015, 1:28pm

Ha! Thanks Jörg! I missed that part!

javadevmtl · June 12, 2015, 2:09pm

Yes, I understand I should have more then one master but this is a test environment and that's all I have as machines for now. And I accept that. The worst case losing my master in my test environment I would expect the cluster to just not accept any requests until master comes back.

If we lose the cluster state like I did, should that prompt the data nodes to just completely wipe out the data (If it's even the case)? If anything there should be a reconciliation phase, where the master doesn't enable the cluster until we can go set tell it what to do?

jprante · June 12, 2015, 3:12pm

I agree that is a troubling situation. The conflict arises when you tell explicitly "master node of 1.6, now please start, but with empty cluster state." This is meant to override previous cluster setups.

Data nodes have no clue about past master node setups. They are passive, they do not even persist cluster states. Although they may have index data present, the master commands them "here is the new empty cluster state, forget all before". And that leads to cleaning up everything that exists.

Why not promoting some of the data noes to master nodes? If you had (at least) three master nodes, and minimum master nodes set to 2 (which I recommend for production), you would have to repeat the mistake on at least three master node setups before everything erases. In that case, the chance is high that at least one master node had survived and kept the previous cluster state on disk. There is a chance the cluster startup halts with an error message or even continues if the masters decide to continue with the saved state.

javadevmtl · June 12, 2015, 3:40pm

I used to have that setup before where data nodes could be eligible masters. I wanted to test dedicated master.

The later seems more stable, but I have to do more testing to see if its true. The cluster seems to feel more stable and perform better with dedicated master. Which underlines the fact that dedicated masters are important and of well course having more then one

dadoonet · June 12, 2015, 4:04pm

dedicated master is very well when you start to have a cluster with for example more than 10 data nodes or when you cluster is heavily used (CPU/memory for example). But, yeah you need to have at least 3 master eligible nodes.

javadevmtl · June 12, 2015, 5:12pm

Yes I'm quite high on the memory usage. Right now with 6 "monthly" indexes of 8 shards + replicas each and 1 billion records I'm at 15GB of RAM per node. Using doc values everywhere I can.

I expect a bit of growth for next year also.

Clinton_Gormley · June 12, 2015, 5:51pm

Sorry for the data loss. We've got a change coming in 2.0 that would prevent this situation, even with only one dedicated master:

javadevmtl · June 12, 2015, 6:48pm

Cool thanks. I think the docs should mention how to upgrade master nodes also or at least remind people to make sure when they unzip a new version and then copy elasticsearch.yml. if the config doesn't state a specific data folder location then to make sure to copy the data folder over for master.

Topic		Replies	Views
Indexes (and thus data) gone after ES upgrade Elasticsearch	11	1627	July 5, 2017
Upgrade from 1.5.2 to 1.6.0 deleted all indices Elasticsearch	2	558	July 6, 2017
Data loss after servers hosting the Primary shard and Replica shard were rebooted at the same time Elasticsearch	1	344	July 6, 2017
Am I losing data? Elasticsearch	5	330	April 27, 2022
Elastic Search Node deleted Elasticsearch	2	278	November 9, 2020

So I upgraded to 1.6.0 and all the data disappeared! Gone! Nada! Zilch!

Related topics