Ok I'm in a test environment thankfully. Maybe how I upgraded?
I'm updating from 1.5.2 to 1.6.0
Running on Windows 2008 R2
The setup is 4 dedicated data nodes and 1 dedicated master (Plus running extra "sites" client node off master machine). That's all I have as machines.
I updated the data nodes first. Followed the recommended upgrade process. That went ok without a hitch!
So then I shut down the the master updated to 1.6.0 restarted it and boom all data was deleted! It's like it doesn't exist. it's not even on the drives! Gone like disappeared in thin air! 1 billion records down the drain!
Checked my RAID arrays just in case. They seem ok. But highly doubt that I would get 4 arrays dead in one shot lol.
I posted my master and 2 of the nodes logs. There doesn't seem to be anything out of the ordinary except for the master node waiting to get a state back from the nodes.
You will see when the data node came back up it new that 2.2TBs where being used of the .2TBs Now all drives on all nodes are basically empty!
Having a single node with master=true seems like you are asking for trouble. That's no excuse for losing your data though.
I think this line from the master log is pretty telling:
[2015-06-11 14:31:49,289][INFO ][gateway ] [MY ES MASTER 01 (Master)] recovered  indices into cluster_state
Did you happen to delete the master's data directory as part of the upgrade process? I haven't checked the code, but I suspect with only a single master Elasticsearch will trust what is on it's master's data directory over the slaves - at least that would explain what you are seeing.
Unfortunately that's all I have in my test environment. I know it's not desired but worst case the cluster downgrades to basic mode right? I'm willing to accept that in test env.
For the master, I unzipped, 1.6.0 copied over the elasticsearch.yml, but not the Data folder from 1.5.2. But would this cause the nodes to wipe-out everything like that?
Did you check the data folder? Is the data still on the disk, and Elasticsearch is not picking it up?
Is it possible that the 1.6 install is looking for the data in a different location?
Agreed. Also check that you are using same cluster name.
No, the actual data folder got wiped clean! Thats the weird part!
I did a rolling upgrade of all the data nodes no problem.
Then I finally shutdown the master, unzipped 1.6.0 for master, then copied elasticsearch.yml from 1.5.2 master and restarted. I forgot to copy data folder for master 1.5.2 to 1.6.0, but that's all. You can see the in the logs the data nodes had 2.5TB of data on startup.
I'm not sure about what happened but IMO doing a rolling upgrade with one single master node could lead to errors.
If you don't have more than one master, then I'd do a full cluster upgrade.
I tried to simulate what you did with one single master node and two data only nodes.
- Started the cluster with 1.5.2
- Created an index with a doc
- Stopped node1, upgrade it to 1.6.0 and restarted (same cluster name, same path...), wait for green
- Stopped node2, upgrade it to 1.6.0 and restarted (same cluster name, same path...), wait for green
- Stopped master, upgrade it to 1.6.0 and restarted (same cluster name, same path...)
GET my document back
Everything went well. So I have no idea about what happened in your case.
Are you sure you waited that all shards/indices were correctly restarted for each step?
But that is the problem. You had a single master and did not transport the cluster state to the new version. Then it must be gone.
Ha! Thanks Jörg! I missed that part!
Yes, I understand I should have more then one master but this is a test environment and that's all I have as machines for now. And I accept that. The worst case losing my master in my test environment I would expect the cluster to just not accept any requests until master comes back.
If we lose the cluster state like I did, should that prompt the data nodes to just completely wipe out the data (If it's even the case)? If anything there should be a reconciliation phase, where the master doesn't enable the cluster until we can go set tell it what to do?
I agree that is a troubling situation. The conflict arises when you tell explicitly "master node of 1.6, now please start, but with empty cluster state." This is meant to override previous cluster setups.
Data nodes have no clue about past master node setups. They are passive, they do not even persist cluster states. Although they may have index data present, the master commands them "here is the new empty cluster state, forget all before". And that leads to cleaning up everything that exists.
Why not promoting some of the data noes to master nodes? If you had (at least) three master nodes, and minimum master nodes set to 2 (which I recommend for production), you would have to repeat the mistake on at least three master node setups before everything erases. In that case, the chance is high that at least one master node had survived and kept the previous cluster state on disk. There is a chance the cluster startup halts with an error message or even continues if the masters decide to continue with the saved state.
I used to have that setup before where data nodes could be eligible masters. I wanted to test dedicated master.
The later seems more stable, but I have to do more testing to see if its true. The cluster seems to feel more stable and perform better with dedicated master. Which underlines the fact that dedicated masters are important and of well course having more then one
dedicated master is very well when you start to have a cluster with for example more than 10 data nodes or when you cluster is heavily used (CPU/memory for example). But, yeah you need to have at least 3 master eligible nodes.
Yes I'm quite high on the memory usage. Right now with 6 "monthly" indexes of 8 shards + replicas each and 1 billion records I'm at 15GB of RAM per node. Using doc values everywhere I can.
I expect a bit of growth for next year also.
Sorry for the data loss. We've got a change coming in 2.0 that would prevent this situation, even with only one dedicated master:
Cool thanks. I think the docs should mention how to upgrade master nodes also or at least remind people to make sure when they unzip a new version and then copy elasticsearch.yml. if the config doesn't state a specific data folder location then to make sure to copy the data folder over for master.