****** Use this pipeline only if you are aware of the tradeoffs. ******
*************************** Watch your step! ***************************
[INFO] Racing on track [company], challenge [index-and-query] and car ['external'] with version [6.2.3].
[WARNING] indexing_total_time is 24166 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 5519 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[ERROR] Cannot race. Error in track preparator (('Could not download [https://s3.amazonaws.com/FOLDER_PATH/company_data.json.bz2] to [/home/ec2-user/.rally/benchmarks/data/rally-tutorial/company_data.json.bz2] (HTTP status: 403)', None))
Getting further help:
Check the log files in /home/ec2-user/.rally/logs for errors.
However, I am able to download data file through CLI (manually).
I am using AWS elasticsearch domain to test the performance and created my own track. It is working fine when I download the data file manually and place it to company folder. But when I use "base-url" property to download the data file automatically from rally it fails to download the data file. The query I am using is as follows:
CMD>> esrally --pipeline=benchmark-only --track-path=/home/ec2-user/.rally/benchmarks/rally-track/company --target-hosts=https ://vpc-test-rally-es-XXXXXX.us-east-1.es.amazonaws.com
on the Rally instance and report back if this is working?
For the record, Rally invokes a normal urllib3 GET request to download the track, so there's really no magic involved here; it will honor the http_proxy env var, if defined.
I observed that the error you are receiving (in the initial comment) is:
Racing on track [company], challenge [index-and-query] and car ['external'] with version [6.2.3].
i.e. Rally fails to download your own company track (sorry, I was under the impression Rally failed to download the default, geonames, track, that's why I asked you to curl exactly that).
So I presume that your company track resides in an S3 bucket of your own.
Rally tries to download things using an http URL, which is what you should be defining as base_url; is your bucket (or object) publicly accessible thought? i.e. if you try to curl -O https://<url_of_your_s3_bucket/... (you'll find the URL in the S3 properties) does this get you the company_data file?
The aws cli tools (aws s3 cp etc.) use the s3://bucket-name/path schema and IAM Roles and Instance Profiles can be adjusted to grant access to a bucket from within an instance, which may explain why you are able to grab the file using aws s3 cp; this doesn't automatically grant access to the object via http, though.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.