I will preface this by saying that I know that CCR would do what we want, but we requested a quote for this and the company said no way, the cost was just too high! We are only using ES very lightly and it is not containing any data which is critical, we can re-load it from source fairly easily. We are using this process to populate a test system from a production system, but we need the test system to be as up to date as possible, so run it quite frequently (every 15 mins currently, but might go down to hourly). If this doesn't work then it doesn't work, I am just trying to understand why it doesn't work because as far as I can see it should do, it is just a snapshot and restore process, which is part of the product feature set.
We have 2 sites, Site A and Site B. We have a 3 node cluster in Site A with a few indices in it used by an application we have here. In Site B we have a single node at this time.
In Site A the cluster has a UNC repo linked to it, which is on file server 1 (FS1). FS1 has MS DFSR replication to FS2 in Site B. The single ES node in Site B is pointed to the same folder location on FS2 and has it mounted as a RO repo.
We have written a script to take a snapshot of the Site A cluster, store it in the FS1 repo, DFSR then replicates the files to FS2 almost instantly, the script waits for 60 seconds, then checks the repo on the node in Site B to see if the snapshot is showing there, if not it waits and checks again. If the snapshot is found within the timeout then the script closes and deletes the selected indices, then restores the selected indices to the node in Site B and checks they go green within the timeout specified.
Now, the issues/questions:
Occasionally the Site B node doesn't show the snapshot in the repo. 95% of the time this process is working great and everything works fine. We have had it run fine for 6 days straight, then had 3 failures in a day, then run for 3 more days with no errors. We have checked the files on FS2 and they are there as far as we can tell. We have checked that the Ubuntu host in Site B can see them in the /mnt location and it can. It seems to only be ES that is not showing the snapshot in the repo. If we don't do anything, but wait long enough it seems to always show, but it can take 10+ minutes to show, when usually, 95% of the time it shows within 90 seconds. The snapshots are always roughly the same size <1GB so copy from FS1 to FS2 very quickly as it is a 1GB link. We have also confirmed that Ubuntu can see them in /mtn, so it is just ES that is not showing the snapshot in the repo for some reason, but does if we wait quite a while...
Can anyone explain why that might be happening?
Can anyone explain how ES actually checks for snapshots in repos? Is there some kind of 'check repo' schedule, are we maybe somehow missing the 'check repo' schedule <5% of the time or something?
<1% of the time it finds the snapshot and restores it, but 1/10 indexes shows as 'red' after the time limit we set (the limit is 90 seconds, which is usually plenty of time as it only takes around 45 seconds for all indices to 'go green' after restore). If we automate a retry it in the script it fails again and stays 'red'. However, if we retry it manually a while later (when we see the failure and have time to look at it) it tends to go 'green'.
This does not make sense to me, the snapshot is the snapshot, it doesn't change, so if I try and restore it now vs in 30 minutes time, it should fail both times, not fail the 1st time, fail on a retry but then work 30 minutes later... I don't get that.
Any help would be appreciated, but please don't post just to say pay for CCR. We are just taking and restoring snapshots, which is part of the product, so should work and will fulfil our requirements just fine.
The only other things I can think to try at the moment are:
Pointing the Site A UNC repo straight to the Site B FS2 server. This would eliminate and DFSR type issue, but given that we can see the files on FS2 and Ubuntu can see them in /mnt, I don't think that is the issue. We would also like the files on FS1 anyway for local restore purposes within Site A, so that wouldn't be ideal.
Someone mentioned 'dual writing' but I am not sure that is really what we want, especially given that a connection to ES in Site B has additional latency, so may cause issues.
You don't say what version you're using; this answer largely applies to all recent versions but for the avoidance of doubt I'm assuming you're running the most recent release, 7.15.1.
If the repository is registered as readonly then every read operation checks the actual files on disk. There's no schedule or anything like that, but nor is there any other magic going on, Elasticsearch is simply opening files and reading them so if it's saying that some data isn't there then it's because it's not seeing the files it needs. It's possible that most of the data is in fact there, but there are other small files that contain metadata without which the data is meaningless. Every last byte is important.
You're not just taking and restoring snapshots, DSFR is adding a whole extra async replication step in the middle and this is the bit that I expect to cause the problems you describe. Snapshots rely on the underlying object store being strongly consistent. I would expect that things will just work if you bypass the DSFR bit and read directly from the source repo. If the problems persist even without DSFR then please share some logs and API outputs and so on and we'll try and understand what else might be happening.
Version wise we are currently running 6.8.13. We deployed some time ago and had been fighting bugs in other areas of the related application before being able to do an upgrade of that app, which finally happened a month ago. We will be looking to update ES ASAP, but there is a lot to do elsewhere unfortunately (always the way!).
When you say 'every read operation checks the actual files on disk', can you clarify? Surely something has to say 'check what snapshots are in the repo'? There are lots of files that make up the snapshot right, so how does it know that they are all there as they might not write into the folder at the same time. You see what I am saying? To me this is key to understanding why the Site B ES is not showing the snapshots straight away, sometimes taking 10 minutes to show it.
Yes, I accept that DFSR is in play, which is why I said that one of the only things I can think of is to point Site A to Site B FS2 server, or as you say do it the other way around. But also, as I said, I can see all of the files in the repo on Site B ES server host OS Ubuntu. I had both locations mounted on the Site B ES server and doing a diff between them shows they are identical at the time the snapshot is not showing on Site B ES.
At the moment I have pointed the Site B ES repo address to FS1 in Site A. This will eliminate DFSR from the equation. I will monitor that for a while and see how it goes. If that resolves the issue then we can probably live with that.
Sorry, I see the ambiguity. By "on disk" I meant "on the disk that holds the repo contents". Ultimately the repository is just a bunch of files in a filesystem somewhere, and Elasticsearch is reading these files using regular system calls.
Yes, I understand that the shapshot is just a bunch of files on a disk. Presumably (but maybe not) one of the files contains the data about the snapshot/s, when it was taken, how many indices it contained etc.
But what actually makes them show up in ES as an actual 'snapshot' when you query the list of snapshots in ES? What makes ES decide to read the files in the repo? Something has to say 'read the files' from disk or 'check for snapshots in the repo'. I am trying to understand what does that, how often it does it etc.
Perhaps, ES only reads the snapshot data into ES/the repo is only checked when you actually query the list of snapshots available via CURL etc? Is it a 'live list' that isn't stored anywhere, only generated when you query _cat/snapshots/reponame?
If it IS a live list that only exists when you run the query then that might explain why I feel like I am missing something.
If so then we are saying that the info about which snapshots are in the repo is not stored in ES anywhere, only when you actually run the query it reads the info about all of those snapshots that are in the repo? So if there are 1000 snapshots in the repo ES doesn't actually know that, or know about them, until you try and do a restore or something, then it checks the repo and tries to use snapshot x, only to find it isn't there or something?
Hopefully that is clearer about what I am trying to understand?
So the issue then is that when you run the query, ES looks at the repo and doesn't see the new snapshot as being there yet.
We still don't have an explanation as to why that is, when I can appear to see all of the files in the Site B repo via the parent Ubuntu OS.
I will continue with the testing where Site B ES is pointed to FS1 in Site A and will see what that throws up. If we still get errors then it can't be DFSR related, if we don't get anymore errors, then it would seem DFSR or FS2 is somehow at fault despite the files seemingly being available to the parent OS.
I have made some progress on this issue, but there were some confusing elements.
Firstly, the issue was indeed DFSR taking too long to replicate at least some part of the snapshot file collection.
The issue was complicated by the testing we had done to compare the file server repo on FS1 in SiteA to FS2 in SiteB, where they looked to be identical, even when ES in SiteB was failing to see the snapshot. This checking turned out to be incorrect. Somehow ES in SiteB when doing the compare was looking at FS1 in SiteA for both the primary and secondary repo we had linked. This was caused by a strange issue with DFS namespaces. So when comparing we were actually comparing FS1 with FS1, which meant that they appeared to show all the files being in both repos correctly, where in actual fact because we were in fact comparing FS1 with FS1 we were not actually doing a compare with FS2/SiteB at all!
After fixing this weird glitch with the namespace and correctly connecting ES to FS1 and FS2 at the same time, we could indeed see that DFSR had some locks and was still in the process of copying some files to FS2 for quite some time after we felt it should have completed the copy. This delay was going over the timeout we had setup for the scripted index restore process.
We have now pointed ES in SiteB to FS1 in SiteA directly and we have had 2 weeks of consistent successful restores, without a single error. The snapshot/restore process is taking place without any need for re-checking/waiting, so instead of having to do several wait/retries and the restore taking around 6-10 minutes, the whole process is now consistently taking under 60 seconds which is great, but the 100% reliability over 2 weeks is brilliant.
This issue can now be closed, it was a DFSR issue, not an ES issue. Thank you for your help.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.