Reattaching Cold Data

If you were to have cold data on removable hard disks (for long-term offline storage), what is the best way to re-attach one of these disks to the cold data node without causing trouble? For example, I was wondering if the following configuration/process would work (in Windows deployment):

  • Cold Data Node (OS + Elasticsearch) on C:\ drive and removable media on separate D:\ partition.

    • Month 1 saved to D:\ drive, then detached and stored in a filing cabinet.
    • New disk mounted on D:\ drive, and Month 2 saved to it, etc. for months...
    • Some time in the future - an audit requires Month 1 data. Can we simply re-mount the Month 1 disk to the D:\ drive and expect to search the data without issue?

I don't think you can do that, the only reliable way to backup data from Elasticsearch is using the Snapshot and Restore API.

While you can create snapshots on hard disks, it needs to be a NFS Shared File Systems that every node has access to it and every time you add or remove a file system repository you need a full restart of the cluster as every node needs to have this configuration.

2 Likes

Correct, that is not supported.

Searchable snapshots | Elasticsearch Guide [8.0] | Elastic would be an option.

1 Like

Alternatively, could I set up my Logstash pipelines to write to both the ES cluster (to keep for x number of days, then delete), and to removable media as raw data. Then, if/when needed have another Logstash pipeline to ingest that historic/archived data into a temporary index. That way, it wouldn't conflict with any existing indexed data. Do you see a problem with that?

You can do that, but it also can create new problems.

Will you stop the ingesting when removing the media? Because depending on how your logstash pipelines are configured, if it can not write to an output it could block another output.

Using pipeline-to-pipeline communication wi the distributor pattern could help in this case, but it would be better to use something like Kafka to store your data first, then one pipeline would read and ingest into Elasticsearch and another one would read an write to a file, when you need to swap the drive you can stop juts the pipeline that writes to a file.

One question, why not use Snapshots on a cloud Service? It is way easier and probably cheaper. Do you have the need to store the data on-premises?

Cloud is not an option for our offline environment, only on-prem.

To answer your other question, I was not planning to stop ingesting when removing media. The same pipelines would continue to ingest in this scenario, we'd just remove the media as it fills up and label it by the first & last date written.

Sounds like I should start looking into Kafka.

You might not be planning it. But as Leandro mentioned, Logstash will block all output if one is not available.

I think that the best option for your case is to use Kafka as a message queue and run two completely independent pipelines.

To put your data in Kafka you could use Logstash with just your inputs and the kafka output.

Once your data is in Kafka you would set up two different pipelines, one that would read from Kafka, parse your messages if needed, and send it to Elasticsearch, and another pipeline that would just read from Kafka and write the raw message to some files in your external hard-drive, just note that the consumer group configuration in those kafka inputs needs to be different.

It would be even better if you could separate those tasks in different servers, one server with your pipeline that writes into kafka and the pipeline that reads from kafka and send to Elasticsearch, and another server with just the pipeline that reads from kafka and write in your external hard drive.

With this configuration when it is time to swap the drive you just stop the logstash service, change the drive, and start it again.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.