hey there,
regularly I have to create a clusters from scratch and test their performances when indices contain data at least from 20 days. The possibility to have a data lake will give me the chance to avoid to wait for that period. So I started to think about Rally usage.
I tried to create a custom track from an already existing index on a test cluster.
Operation was completed correctly but I would like to know if, from your point of view, this could be a good way to create a data lake.
The chance to have every time I need a set of data (documents) to fill my new cluster.
With a custom track saved wherever I want, I could load it into the cluster and speed up the operation since I am quite sure that Rally will be faster than real day-by-day ingest operation (i.e. Rally will take 1-2 days to bulk 20 days of documents).
The by far fastest way to load data large volumes of data into a cluster for testing is by restoring a saved snapshot. This naturally means that the data will be covering a set time period rather than the last X days but that might be good enough for some types of performance testing, which you can perform using Rally.
thanks, this could be a good idea. But I have one doubt: Rally will compress (and then will uncompress, I know) the track, while snapshot will not do that. Is it a downside for snapshot approach?
Indices are already compressed (although not quite as well as compressed JSON data) and you save a lot of time and resources as data does not need to be indexed. I have previously used this method to quickly restore terabytes of data to clusters for performance testing where indexing the full data set would have taken a lot longer.
Restoring a snapshot is a LOT faster and more efficient than indexing the same amount of data. If you want to build up even larger data volumes you can even restore the same snapshot multiple times if you rename the indices when you restore.
You can naturally also combine the two methods and restore a background data set using snapshots and at gbe same time index new data using Rally.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.