Is Rally a good choice to create data lake?

hey there,
regularly I have to create a clusters from scratch and test their performances when indices contain data at least from 20 days. The possibility to have a data lake will give me the chance to avoid to wait for that period. So I started to think about Rally usage.

I tried to create a custom track from an already existing index on a test cluster.
Operation was completed correctly but I would like to know if, from your point of view, this could be a good way to create a data lake.

thanks

What exactly do you mean by data lake here?

The chance to have every time I need a set of data (documents) to fill my new cluster.
With a custom track saved wherever I want, I could load it into the cluster and speed up the operation since I am quite sure that Rally will be faster than real day-by-day ingest operation (i.e. Rally will take 1-2 days to bulk 20 days of documents).

The by far fastest way to load data large volumes of data into a cluster for testing is by restoring a saved snapshot. This naturally means that the data will be covering a set time period rather than the last X days but that might be good enough for some types of performance testing, which you can perform using Rally.

1 Like

thanks, this could be a good idea. But I have one doubt: Rally will compress (and then will uncompress, I know) the track, while snapshot will not do that. Is it a downside for snapshot approach?

Indices are already compressed (although not quite as well as compressed JSON data) and you save a lot of time and resources as data does not need to be indexed. I have previously used this method to quickly restore terabytes of data to clusters for performance testing where indexing the full data set would have taken a lot longer.

2 Likes

Using Rally I have this scenario:

150 millions of docs
44 gb compressed
700 gb uncompressed (I suppose including 1 replica)

Could I get a similar result also with snapshot?
Apart from that, can you explain the sentence "save time as data does not need to be indexed"?

Restoring a snapshot is a LOT faster and more efficient than indexing the same amount of data. If you want to build up even larger data volumes you can even restore the same snapshot multiple times if you rename the indices when you restore.

You can naturally also combine the two methods and restore a background data set using snapshots and at gbe same time index new data using Rally.

2 Likes

ok, I will give a chance also to snapshot idea :grinning:
Thanks