Need to test a index with terabytes of data, how can I do

GITnewfish · May 12, 2021, 8:56am

I need to test performance of appand to a index with terabytes of data, and the index mapping is customed.
Is there a solution to this scenario?

Refer to this topic: Increase data size in Rally existing tracks

Whether to support looping to write a fixed index?

I tried, but got error.

GITnewfish · May 13, 2021, 6:35am

Eventually I did this by writing data to large files.
However, it is too slow to compress large files.
Is there a big difference between source-file performance measured using the original file documents.json and the compressed file documents.json.bz2?

Christian_Dahlqvist · May 13, 2021, 7:02am

Have a look at the rally-eventdata-track. Unlike other tracks this does not rely on data in files but does instead generate data at runtime based on a set of probability distributions. This makes it possible to generate very large amounts of data with just track configuration. You can use this as is and just create a modified config or use this as a base for generating your own track handling your particular type of data.

This blog post describes how it was used to generate 4 TB of indexed data for a set of storage benchmarks. This video also discusses this track and its use.

GITnewfish · May 13, 2021, 7:31am

This seems to be an extension based on the event-data log type, but my scenario requires testing with our actual business logs.Can rally-EventData-Track extend the data volume with custom index mapping?

Christian_Dahlqvist · May 13, 2021, 8:23am

I suspect you may need to create a new custom track.

GITnewfish · May 14, 2021, 6:02am

Yes，does rally-EventData-Track support according custom track to generate terabytes of data?
by the way, doce source-file support config more than one file? I think this will be convenient to Increase and decrease doc amount according to different requirements

Christian_Dahlqvist · May 14, 2021, 6:23am

If the event format created by the rally-eventdata-track can be used it is relatively easy to create a new challenge that can generate a very large amount of data. You may also be able to alter the mappings used if necessary. If you need a specific event format and mappings you probably customize the track or generate files.

I am still not sure I fully understand what you are looking to test. Could you please elaborate on what you are looking to test and achieve? Are you looking to index into a set of time-based indices and see hoe the cluster performs with large amounts of data or are you going to index into a single index that will grow very large? What is your use case?

GITnewfish · May 24, 2021, 8:30am

We want to test sending 2T of specific data to an index to see how large the cluster can reach.Now I have solved this problem, but I had a puzzle during the testing.

first question: Does the final result of throughput include replicas? I think the result was only primary shard and it seems according to samples to culculate: sum(bulk_size)/time_period. Does this right?
I have read How Write throughput is calculated in Rally - #2 by dliappis

second question: When the index_append start, i use iostat to monitor the io, sometimes the readbyte is zero, why? the bulk_indexing_client_num is 64.

third question: How can I know rally has start the amount of clients that I set?

system · June 21, 2021, 8:31am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Event-data-track Elasticsearch rally	2	475	April 25, 2019
Increase data size in Rally with existing tracks Elasticsearch rally	4	704	December 9, 2019
How to continuously keep indexing and querying in ES Elasticsearch rally	4	635	March 28, 2019
Is Rally a good choice to create data lake? Elasticsearch rally	9	502	September 28, 2021
Increase data size in Rally existing tracks Elasticsearch rally	3	2414	February 20, 2018

Need to test a index with terabytes of data, how can I do

Related topics