Yes the document corpora will occupy space on each load driver node.
To avoid manually populating it on the load driver machines, you can store the (compressed) corpora in a network location accessible from your machines and use the
source_file properties to specify the location, as per the docs. When you start Rally, it will download the corpora, if not present, and uncompress.
Having Rally "stream" the document corpora would introduce a potential bottleneck (would easily saturate the network interface, unless you have a dedicated network interface for the TCP/IP route towards the location of the docs as well as higher CPU+IO requirements) and thus taint the results.
If you are looking for a manual way to distribute large json files here are a few tips:
- such files compress very well. It's advisable to use something like
pbzip2 -v -k -m10000 documents.json which takes advantage of all CPUs and saves time.
- distribute files to the right directories, while should be faster now that they are compressed.
- uncompress the files, again utilizing all CPUs with
pbzip2 -v -d -k -m10000 documents.json.bz2