What happens if data folder is wiped out on individual nodes

hi,

This is a theoretical question at this time. But would like to understand how ES will handle it. Lets say, I have a multi node cluster with sufficient replicas of the indexes. Lets also assume that my installation folder, log and data are in separate drives. For some reason, if the data drive were go corrupt or crash and I replace it with brand new drive Or lets say someone wiped the contents of the data folder completely, will ES be able to rebuild the data from the other replica shards to that same state? Will this need any kind of manual intervention?

Also, what will be the typical recovery time in such a case? (I know that it will depend of size of data. But if you have some info from any of your previous such experience, please do share)

Thanks!

As long as there are replicas for it, yes.

Only to restart the Elasticsearch service due to the failure of the drive.

As you mention, that's based on a lot of variables and you'd need to check yourself on your cluster/

Also, what will be the typical recovery time in such a case?

Adding that as you have a replica, your service will still working well for your users whatever the time it will take to recover the replicas.

@warkolm , @dadoonet - thanks for your reply. Now what if I change the scenario from accidental deletion to knowingly setting it up this way, what would change?

Lets say, on few nodes in the cluster, I create a Ramdisk, assign it a drive letter and allocate my data location to this drive. But on other nodes, i will have data folder on storage. Keeping the JVM allocation aside, this theoretically means all my ES data is now sitting in memory. What kind of challenges are expected when running this kind of setup? Is there a way that I can ensure that atleast 1 copy (either primary or replica) of the shard is not on this kind of setup (essentially to make sure, I have physical copies on other nodes).

Other than total loss of data on system restart? :wink:
Not a lot I can think of.

If you have a replica, then that is guaranteed.

Could you explain why you would want to do such a thing? If you have enough RAM to fit all your data then it'll all fit in the filesystem cache too, so reads should be served from RAM even if you're using proper storage.

No. I do not think Elasticsearch guarantees the existence of a physical copy of each write if there is even a single ramdisk-backed node. If there are two copies of a shard, one persistent and one ephemeral, and the persistent one fails then a new one will eventually be allocated elsewhere, but until that happens Elasticsearch will acknowledge indexing operations after they are written only to the ephemeral shard.

If you have n ramdisk-backed nodes then it's even worse. You would need to have at least n+1 copies of each shard to avoid them all being allocated to ramdisk-backed nodes even when the cluster is healthy, and if it's unhealthy then frankly all bets are off.

Please don't do this.

I think you technically might be able to do this using shard allocation awareness, but agree with David that it sounds like a bad idea. What are you looking to achieve through such a setup?

1 Like

@Christian_Dahlqvist thanks for pointing to the shard allocation awareness.

@DavidTurner wont using shard allocation awareness ensure that ES always write a copy to non-ram disk nodes, if that is how I segregate the nodes?

To give some further details and background, our ES cluster is focused towards search of structured data. So we will be able to reindex the data from scratch from SOR, if everything goes south in worst case scenario. Our data itself is not too big (less than 15 GB) and can be reindexed in matter of few minutes.

Now coming back to why I want to try this out. We are partially re-purposing our existing servers which currently run our app layer and web layer. Over period of time, all our apps have been converted to client side apps. So the web servers are not really doing much apart from serving the initial page load. So we are running our ES as windows service on both these type of servers, along with them continuing to support current apps.

The thought was that on these web servers, we could potentially run ES as in purely in-memory mode and on app servers, we run it with disk mode. From request routing perspective, plan was to create a VIP with loader balancing between the web servers and route all search requests into this VIP, where as all indexing requests can go to another VIP on the app servers. We have 3 web and 3 app servers from load balancing perspective.

I get your point about "if data can anyway fit in memory, then ES will anyway serve it from RAM". But for this to happen, the entire index has to be loaded into memory. Previously, we could index warmer queries, which has been removed long time back. So now, the way to achieve this would be via index.store.preload option. The downside for me to use this setting is that this will be applied across nodes on the cluster, where this index resides. So even on the servers, where I don't intend to route my search traffic, this will end up happening. So I will still need to have RAM to back this up on servers where I dont want to route the search traffic.

Similar to the shard awareness feature, if the preload settings could be controlled at an attribute level, then I think that we could still go with regular disk based setup. Only on the nodes where search traffic has to be routed, this attribute would be set and the index would be preloaded. Is this something can be requested as a feature?

To also add, shard allocation awareness also works better, as it will prefer local shards. So essentially, our search requests would try to serve the response only from the shards on the web servers as we would be routing all search requests to it.

The OS ends up caching the files it uses, ie the Lucene index files that are created, in memory after they have been accessed a few times. That's why we have the 50%:50% guidance for heap:non-heap memory allocation.

But you can't have a primary on a node that is a RAM disk and then the replica elsewhere. Allocation awareness only applies for indicies+shards that have the same awareness attributes.

No. Elasticsearch can perhaps be coerced to eventually allocate a copy of each shard onto at least one disk-backed node, but this process can take some time.

I think I see the confusion. An active index is a load of on-disk data together with a load of in-memory data structures that Elasticsearch builds. Warming (which still happens a bit) is more about the in-memory data structures, which are nothing to do with the underlying storage since they only live in-memory. The only possible advantage that a ramdisk could have here is to make accessing the underlying files faster by avoiding going to disk, but I think that this advantage does not really exist because the filesystem cache offers the same behaviour.

Preloading does something quite different from warming as far as I can see from a brief look at the code: it seems to load the index data into filesystem cache, on a "best effort" basis, so if there isn't enough filesystem cache then some of the data will be paged out again.

Not really: if you don't have the memory for the filesystem cache on some of the nodes then the only effect is that some of the data might be evicted from the cache on those nodes.

1 Like

Why do you say so? If I look at the rack example from the shard awareness documentation page, it says:

If Elasticsearch is aware of the physical configuration of your hardware, it can ensure that the primary shard and its replica shards are spread across different physical servers, racks, or zones, to minimise the risk of losing all shard copies at the same time.

So lets says i have RamDisk based nodes as rack1 and physical disk based nodes as rack2, even with 1 replica, atleast based on the documentation, Elasticsearch is supposed to write 1 copy to rack1 and second copy to rack2.

Since index writes uses "quorum" as default for write consistency, if I had the index defined with 1 replicas, wouldn't having shard awareness enabled always ensure that I get a success response only after both racks have a copy ? BTW, I will probably need to use the "wait_for_active_shards" option on indexing operation to ensure this, since quorum concept now looks to be outdated for write consistency. Wont this work, the way that I am expecting it to work and also ensure that I always have a copy on node backed by disk storage?

Also, when you talk about file system cache, I am assuming that you are talking about what can fit in memory + pagefile (with swapping enabled). So if I have somewhat understood it right, for me to get an relatively equivalent kind of setup without using RamDisk, I will need to disable pagefile, use preload option and make sure that overall memory allocation is sufficient to hold -> all the index data + JVM Swap + memory needed for other apps + X times the thread stacksize. Now coming to the difficult part, how do I calculate this limit :blush:, as this the amount of total RAM that I will need overall on my each node. It was easier to calculate it in the Ramdisk option, where I just had to worry about the index size to allocate the additional memory, apart from the current allocation.

Which version are we talking about here? This is only true of versions that have reached the end of their supported life (2.4 and earlier), and even then it was a best-effort thing. Elasticsearch only ever guaranteed writing to a single shard, and this confusion was resolved by removing the option in 5.0.0 - see #19454.

This option was introduced in 5.0.0 and, like write_consistency, it is also a best effort option, not a guarantee. The docs say:

It is important to note that this setting greatly reduces the chances of the write operation not writing to the requisite number of shard copies, but it does not completely eliminate the possibility, because this check occurs before the write operation commences. Once the write operation is underway, it is still possible for replication to fail on any number of shard copies but still succeed on the primary.

It would greatly surprise me if Windows were to spill the filesystem cache into the pagefile. Linux doesn't. The pages are already on disk, so it doesn't make sense to swap them out like that, they can just be dropped if the memory is needed for something else. The pagefile is basically the opposite of the filesystem cache: one copies infrequently-used memory pages to disk, and the other copies frequently-used disk pages to memory.

I do not see how this is different from what you need to do with a ramdisk-backed node. A ramdisk has the effect of preventing the OS from swapping out infrequently-used index file pages, limiting its options under memory pressure, which seems like it'd potentially be worse for performance.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.