Rally failed for downloading data file from aws S3

It reports an error as following :


****** Use this pipeline only if you are aware of the tradeoffs. ******
*************************** Watch your step! ***************************


[INFO] Racing on track [company], challenge [index-and-query] and car ['external'] with version [6.2.3].

[WARNING] indexing_total_time is 24166 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 5519 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.

[ERROR] Cannot race. Error in track preparator (('Could not download [https://s3.amazonaws.com/FOLDER_PATH/company_data.json.bz2] to [/home/ec2-user/.rally/benchmarks/data/rally-tutorial/company_data.json.bz2] (HTTP status: 403)', None))

Getting further help:



[INFO] FAILURE (took 1 second)

However, I am able to download data file through CLI (manually).

I am using AWS elasticsearch domain to test the performance and created my own track. It is working fine when I download the data file manually and place it to company folder. But when I use "base-url" property to download the data file automatically from rally it fails to download the data file. The query I am using is as follows:
CMD>> esrally --pipeline=benchmark-only --track-path=/home/ec2-user/.rally/benchmarks/rally-track/company --target-hosts=https
://vpc-test-rally-es-XXXXXX.us-east-1.es.amazonaws.com

Here is the log snapshot:

Hello,

You mentioned:

It is working fine when I download the data file manually and place it to company folder.

Did you try to download the file manually (e.g. using curl or wget) in the same instance where you are running Rally?

Can you please execute:

curl -O http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geonames/documents-2.json.bz2

on the Rally instance and report back if this is working?

For the record, Rally invokes a normal urllib3 GET request to download the track, so there's really no magic involved here; it will honor the http_proxy env var, if defined.

Dimitris

I am able to download the documents-2.json.bz2. Following is the output :

-rw-rw-r-- 1 ec2-user ec2-user 264698741 Nov 1 08:35 documents-2.json.bz2

And I am able to download the data file manually by AWS CLI. Following is the sample command:

aws s3 cp s3://BUCKET_NAME/FOLDER_PATH/company_data.json.bz2 company_data.json.bz2

Hi,

I observed that the error you are receiving (in the initial comment) is:

Racing on track [company], challenge [index-and-query] and car ['external'] with version [6.2.3].

i.e. Rally fails to download your own company track (sorry, I was under the impression Rally failed to download the default, geonames, track, that's why I asked you to curl exactly that).

So I presume that your company track resides in an S3 bucket of your own.
Rally tries to download things using an http URL, which is what you should be defining as base_url; is your bucket (or object) publicly accessible thought? i.e. if you try to curl -O https://<url_of_your_s3_bucket/... (you'll find the URL in the S3 properties) does this get you the company_data file?

The aws cli tools (aws s3 cp etc.) use the s3://bucket-name/path schema and IAM Roles and Instance Profiles can be adjusted to grant access to a bucket from within an instance, which may explain why you are able to grab the file using aws s3 cp; this doesn't automatically grant access to the object via http, though.

Regards,
Dimitris

when I am trying to download dta file with curl command it through following error :
e.g. > curl -O https://s3.amazonaws.com/BUCKET_NAME/FOLDER_PATH/company_data.json

Error :
<?xml version="1.0" encoding="UTF-8"?>
AccessDeniedAccess Denied7867ADC755A3F2ECVG4Lhh2B4veFd8bhtUk1E8Ew/Kg/CCPCSGDdydMhm5ArinXBBaGnY9bLXAWFXs5ZgEsqPPWOJlQ=

According to above error do I need to make my object public?

curl -O https://s3.amazonaws.com/BUCKET_NAME/FOLDER_PATH/company_data.json

This doesn't look right a correct http URL for an s3 bucket (the s3:// schema is used by the aws cli command only).

Please refer to: Buckets overview - Amazon Simple Storage Service

to see how to get the URL of the bucket+object and how to make it public.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.