Synced flush and keep indexing on primary

Hi,
I have the following problem:
I need to shut down a replica to perform some operations on it
the index contains lots of data so I'd like to use synced flush to speedup replica restart
In addition I'd like to continue indexing on the primary
Is this possible
I thought one way to do this is to make use of multiple commit points
so after the primary commits the commit point in common with the replica is not lost and it's possible to sync only the segments that have changed

thank's
Matteo

No, it's not possible to reliably do a synced flush with ongoing indexing. Firstly, the synced flush can fail on some of the shard copies if they are not in sync with the primary, which can happen with ongoing indexing. Secondly, further activity on the shard might remove the sync marker.

Elasticsearch 6+ can perform an operation-based recovery when the replica comes back, as long as the replica is allocated back to the same node and every missing operation is still retained in the translog.

I'm curious - what operations do you mean?

Hi David!
thanks for the answer.
The operation in question is actually a backup, so we'll probably be better served by the snapshot api. But the requirement of a shared filesystem would have take some time and we were in a hurry, so we decided to put a replica offline and copy the data folder.
We couldn't stop indexing because the copy takes about 3 hours so replica restart was very long: about 5 hours.
Having planned this in time we'd use the snapshot api and had replica always online.
But if a node goes down for unforeseen reasons the long restart time will still be a problem.
I'm not 100% sure but I'm quite confident that SOLR does keep multiple commit points, so that you don't have to stop indexing/flushing to have a fast restart. Does it make sense to you?
thanks again

Note carefully the warning in the manual regarding filesystem-level backups:

It is not possible to back up an Elasticsearch cluster simply by taking a copy of the data directories of all of its nodes. Elasticsearch may be making changes to the contents of its data directories while it is running, and this means that copying its data directories cannot be expected to capture a consistent picture of their contents. Attempts to restore a cluster from such a backup may fail, reporting corruption and/or missing files, or may appear to have succeeded having silently lost some of its data. The only reliable way to back up a cluster is by using the snapshot and restore functionality.

If you restore from this kind of backup then you can end up with unreadable indices and all sorts of other complications, and I strongly suggest you rethink this.

As long as it's down for a short enough time that it can recover any missing operations from the translog (512MB or 12 hours by default) then the recovery should just involve copying the missing operations over.

Hi David,
the warning says you can't backup by copying the data dir... with ES up and running!
the replica we are backing up contains all the data we need, it's consistent, it's shut down.
If you can explain me how ES can make changes to that data while being shut down I'll be really surprised.
The transaction log, to my knowledge, doesn't survive a flush. and the documentation for the flush just talks about heuristics. In our case 512MB is the limit that's reached first. Can I count that a flush isn't called before that limit? Can I raise that limit temporarily? How much is it safe to raise that limit?

That's not what the first sentence says:

It is not possible to back up an Elasticsearch cluster simply by taking a copy of the data directories of all of its nodes.

Clearly Elasticsearch cannot make changes to the directory when it's not running, that's not the issue. The problem is that the running nodes can be making changes to the data that they hold on disk that render them incompatible with the so-called backup. If you restore from such a copy you risk rolling back something vital like a mapping update. The consequence could be that all the shard copies that had previously processed this mapping update might now be corrupt, because rolling back a mapping update is not supported.

There's a few safeguards in place to try and detect this kind of issue, but they're very much on a best-effort basis as it's provably impossible to catch this kind of situation reliably. Distributed systems are hard.

It's your data so of course it's your decision whether it's worth the risk.

Thanks for the always interesting discussion David!
There's something I don't quite understand from your reply, and I'd like to.
Hope you can help me

What I'm going to do is restore my "backup" to a new single instance of ES.
I don't think there's other state (ex cluster state) that needs to be migrated apart from the data and (static) mapping configuration right?

When you say "shard copies that had previously processed this mapping update might be corrupt" I can't follow you
You are implying that when I restore the "backup" the cluster knows it passed further that state.
The state of the cluster is fully contained in: data dir, cluster status in master nodes.
Let's say I have my "backup" and the mapping configuration (which is statically defined in my case) and I want to restore this state in a new cluster (assuming the old is dead).
Is there anything that could go wrong?

If you think any of my assumptions is wrong please tell me
Thanks

I see. Hmm. So the reasoning is that there's no way for this new instance to encounter a node "from the future" and therefore you can't see the inconsistencies I was worried about? I've not been able to construct a scenario in which this breaks so horribly, although I still feel that snapshot/restore is a better idea in the long run.

Hi David!
yes, that's the reasoning.
Sure: long run is snapshot/restore, but sometimes the "dirty way" is usefull (if it works)

What can you tell me about the other question:

The transaction log, to my knowledge, doesn't survive a flush. and the documentation for the flush just talks about heuristics. In our case 512MB is the limit that's reached first. Can I count that a flush isn't called before that limit? Can I raise that limit temporarily? How much is it safe to raise that limit?

(i still haven't understood how to recall parts of previous messages :roll_eyes: )
Thanks again

The translog settings allow you to change the retention of the translog on a per-index basis. Note that the number is actually per-shard, so if you have 10 shards and it's set to 512MB then that means you will need up to 5120MB of disk space for the translog of that index.

I think the main drawback with increasing the limit is the extra disk space it might consume.

Wonderfull!
So the important setting to avoid a costly file based sync is tuning

index.translog.retention.size
index.translog.retention.age

I suppose it works like this:

the oldest of flushed transaction log files is deleted if (and only if)
it exceeds retention age OR the sum of flushed transaction log files size exceeds retention size

is it correct?
does index.translog.retention.size include also current transaction log?

I think this feature is really important for some people (us for example) and documentation could be more specific

Yes, that's my understanding too.

A PR to clarify the docs is always welcome :slight_smile:

I'd buy you a beer!!!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.