Is Rally a good choice to create data lake?

rschirin · August 30, 2021, 1:59pm

hey there,
regularly I have to create a clusters from scratch and test their performances when indices contain data at least from 20 days. The possibility to have a data lake will give me the chance to avoid to wait for that period. So I started to think about Rally usage.

I tried to create a custom track from an already existing index on a test cluster.
Operation was completed correctly but I would like to know if, from your point of view, this could be a good way to create a data lake.

thanks

warkolm · August 30, 2021, 9:12pm

What exactly do you mean by data lake here?

rschirin · August 31, 2021, 6:20am

The chance to have every time I need a set of data (documents) to fill my new cluster.
With a custom track saved wherever I want, I could load it into the cluster and speed up the operation since I am quite sure that Rally will be faster than real day-by-day ingest operation (i.e. Rally will take 1-2 days to bulk 20 days of documents).

Christian_Dahlqvist · August 31, 2021, 6:24am

The by far fastest way to load data large volumes of data into a cluster for testing is by restoring a saved snapshot. This naturally means that the data will be covering a set time period rather than the last X days but that might be good enough for some types of performance testing, which you can perform using Rally.

rschirin · August 31, 2021, 7:55am

thanks, this could be a good idea. But I have one doubt: Rally will compress (and then will uncompress, I know) the track, while snapshot will not do that. Is it a downside for snapshot approach?

Christian_Dahlqvist · August 31, 2021, 7:59am

Indices are already compressed (although not quite as well as compressed JSON data) and you save a lot of time and resources as data does not need to be indexed. I have previously used this method to quickly restore terabytes of data to clusters for performance testing where indexing the full data set would have taken a lot longer.

rschirin · August 31, 2021, 8:13am

Using Rally I have this scenario:

150 millions of docs
44 gb compressed
700 gb uncompressed (I suppose including 1 replica)

Could I get a similar result also with snapshot?
Apart from that, can you explain the sentence "save time as data does not need to be indexed"?

Christian_Dahlqvist · August 31, 2021, 8:16am

Restoring a snapshot is a LOT faster and more efficient than indexing the same amount of data. If you want to build up even larger data volumes you can even restore the same snapshot multiple times if you rename the indices when you restore.

You can naturally also combine the two methods and restore a background data set using snapshots and at gbe same time index new data using Rally.

rschirin · August 31, 2021, 8:49am

ok, I will give a chance also to snapshot idea
Thanks

system · September 28, 2021, 8:49am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Need to test a index with terabytes of data, how can I do Elasticsearch rally	8	1150	June 21, 2021
How to reuse already indexed data in next race? Elasticsearch rally	2	627	April 12, 2018
How to create and use data for indexing using Rally Elasticsearch rally	5	876	April 8, 2020
Questions about custom tracks Elasticsearch rally	4	732	March 7, 2018
Rally Benchmarking in Test Cluster-Download Elasticsearch rally	6	577	March 20, 2019

Is Rally a good choice to create data lake?

Related topics