Track data on the load driver nodes

Hi all,

We have built a data generation tool that is distributed across 3 nodes. After data generation, we have 3 files containing documents. Let's say we have documents_1.json, documents_2.json, and documents_3.json. Now suppose we have a distributed load test driver with one benchmark co-ordinator and 3 worker nodes. So do I need to copy 3 document files on all 3 worker nodes? If yes, what if document files are huge (in TBs). Do we need space required by 3 document files on all the worker nodes? Is there any workaround for this?

Thanks,
Akhil

Hello,

Yes the document corpora will occupy space on each load driver node.
To avoid manually populating it on the load driver machines, you can store the (compressed) corpora in a network location accessible from your machines and use the base-url + source_file properties to specify the location, as per the docs. When you start Rally, it will download the corpora, if not present, and uncompress.

Having Rally "stream" the document corpora would introduce a potential bottleneck (would easily saturate the network interface, unless you have a dedicated network interface for the TCP/IP route towards the location of the docs as well as higher CPU+IO requirements) and thus taint the results.

If you are looking for a manual way to distribute large json files here are a few tips:

  • such files compress very well. It's advisable to use something like pbzip2 -v -k -m10000 documents.json which takes advantage of all CPUs and saves time.
  • distribute files to the right directories, while should be faster now that they are compressed.
  • uncompress the files, again utilizing all CPUs with pbzip2 -v -d -k -m10000 documents.json.bz2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.