Faster to index raw data or load a static data dump

I want to leverage ES for analysis of static data. I was using a subset and it worked as planned. I would be able to query, build graphs in kibana etc. It is quite useful. That being said, since the data is static, Should I index all of the raw data every time i need to boot up my cluster, or can i restore a dump? I think that Search for analysis is the most important so that I can effectively query the data, BUT if it most effective to ingest the data again, I would need to speed that process up. Since I need the full dataset in memory, I need to parse and consume 5Tb of metrics which takes time.

If i create the initial elasticsearch db the way I like it, maintaining all of the data, would it be more ideal to dump the entire thing to disk as a datadump of sorts, or reprocess it each time I need to turn on the cluster?

Ideally I want to adhoc turn on and off the cluster when not needing it for this specific use case to preserve it. I can turn on and off the cluster and it would have all of its data, but if i need to disassemble anything or move hardware, I was not sure if it would be ideal to store the esdb files in 1 HDD for cold storage purposes or the raw data.

I'm pretty sure you want Elasticsearch snapshots, with the snapshot stored in S3 or a local block device like an external HDD. Snapshot restores are more efficient than re-ingestion / re-indexing and are a "data dump" the way you describe.

Thanks fantastic, especially since it is quicker to restore. I'll do that then. Are there any special rules with the system when restoring the data? I presume it can be loaded into a new cluster (with a different set of nodes and masters)

Confused on why you need to restore - does your analyses change the data or do the data sets change? Why not just load the data and leave it on disk (stop the nodes/VMs), or if on a cloud like AWS, snapshot the disks so you can build new clusters very quickly - both restore or clone a system in minutes, not the hours it might take to load 5TB of snapshots.

Yes. See docs on restoring which explain what you might want to do if the new cluster is different to the old one. There is one (fairly sensible) caveat: Elasticsearch version differences between clusters, see here for details.

That works well too but a quick side note - shut down the VM writing to the disk first :). From our snapshot docs:

You cannot back up an Elasticsearch cluster by simply copying the data directories of all of its nodes. Elasticsearch may be making changes to the contents of its data directories while it is running; copying its data directories cannot be expected to capture a consistent picture of their contents. If you try to restore a cluster from such a backup, it may fail and report corruption and/or missing files. Alternatively, it may appear to have succeeded though it silently lost some of its data. The only reliable way to back up a cluster is by using the snapshot and restore functionality.

(You can snapshot "on the fly" in AWS - don't do that in this instance.)

I was thinking that after initial use was complete, I can dump it to cold storage to re purpose the machines. Then as needed I can spin everything back up as needed if, after the project ends, we need to revisit it. It would allow me to then clear the esdb and prep for follow on utility.

If I don't need the dataset after expiration, shifting to cold storage / tape, allows me to free up all ssd storage.

@Emanuil - Agreed; was just saying use EBS snaps to build new VMs; stopping the nodes first, though, or else the data won't be sync'd - it's a variation on copying files but much better as the whole server /cluster is snapped at once and should recover - i.e. I stop all nodes, stop the VM; snap the disks.

Then build new VMs from those disks (ideally single disk so all OS, code, data all together) - bring up masters and data nodes in cluster restart, ideally with delayed rebalancing/shard movement and should come up, just as if was total power loss (though graceful shutdown).

But important to not snap while running or else data won't be sync'd and cluster will be confused.

1 Like

Snapshots seem best solution, then, especially as sounds like you're not on a cloud.