Eventually I did this by writing data to large files.
However, it is too slow to compress large files.
Is there a big difference between source-file performance measured using the original file documents.json and the compressed file documents.json.bz2?
Have a look at the rally-eventdata-track. Unlike other tracks this does not rely on data in files but does instead generate data at runtime based on a set of probability distributions. This makes it possible to generate very large amounts of data with just track configuration. You can use this as is and just create a modified config or use this as a base for generating your own track handling your particular type of data.
This blog post describes how it was used to generate 4 TB of indexed data for a set of storage benchmarks. This video also discusses this track and its use.
This seems to be an extension based on the event-data log type, but my scenario requires testing with our actual business logs.Can rally-EventData-Track extend the data volume with custom index mapping?
Yes,does rally-EventData-Track support according custom track to generate terabytes of data?
by the way, doce source-file support config more than one file? I think this will be convenient to Increase and decrease doc amount according to different requirements
If the event format created by the rally-eventdata-track can be used it is relatively easy to create a new challenge that can generate a very large amount of data. You may also be able to alter the mappings used if necessary. If you need a specific event format and mappings you probably customize the track or generate files.
I am still not sure I fully understand what you are looking to test. Could you please elaborate on what you are looking to test and achieve? Are you looking to index into a set of time-based indices and see hoe the cluster performs with large amounts of data or are you going to index into a single index that will grow very large? What is your use case?
We want to test sending 2T of specific data to an index to see how large the cluster can reach.Now I have solved this problem, but I had a puzzle during the testing.
first question: Does the final result of throughput include replicas? I think the result was only primary shard and it seems according to samples to culculate: sum(bulk_size)/time_period. Does this right?
I have read How Write throughput is calculated in Rally - #2 by dliappis
second question: When the index_append start, i use iostat to monitor the io, sometimes the readbyte is zero, why? the bulk_indexing_client_num is 64.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.