Rally failed for downloading data file from aws S3

mahesh_varak89 · October 31, 2018, 2:02pm

It reports an error as following :

****** Use this pipeline only if you are aware of the tradeoffs. ******
*************************** Watch your step! ***************************

[INFO] Racing on track [company], challenge [index-and-query] and car ['external'] with version [6.2.3].

[WARNING] indexing_total_time is 24166 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.
[WARNING] refresh_total_time is 5519 ms indicating that the cluster is not in a defined clean state. Recorded index time metrics may be misleading.

[ERROR] Cannot race. Error in track preparator (('Could not download [https://s3.amazonaws.com/FOLDER_PATH/company_data.json.bz2] to [/home/ec2-user/.rally/benchmarks/data/rally-tutorial/company_data.json.bz2] (HTTP status: 403)', None))

Getting further help:

Check the log files in /home/ec2-user/.rally/logs for errors.
Read the documentation at https://esrally.readthedocs.io/en/1.0.1/
Ask a question on the forum at https://discuss.elastic.co/c/elasticsearch/rally
Raise an issue at https://github.com/elastic/rally/issues and include the log files in /home/ec2-user/.rally/logs.

[INFO] FAILURE (took 1 second)

However, I am able to download data file through CLI (manually).

I am using AWS elasticsearch domain to test the performance and created my own track. It is working fine when I download the data file manually and place it to company folder. But when I use "base-url" property to download the data file automatically from rally it fails to download the data file. The query I am using is as follows:
CMD>> esrally --pipeline=benchmark-only --track-path=/home/ec2-user/.rally/benchmarks/rally-track/company --target-hosts=https
://vpc-test-rally-es-XXXXXX.us-east-1.es.amazonaws.com

Here is the log snapshot:

dliappis · November 1, 2018, 8:31am

Hello,

You mentioned:

It is working fine when I download the data file manually and place it to company folder.

Did you try to download the file manually (e.g. using curl or wget) in the same instance where you are running Rally?

Can you please execute:

curl -O http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geonames/documents-2.json.bz2

on the Rally instance and report back if this is working?

For the record, Rally invokes a normal urllib3 GET request to download the track, so there's really no magic involved here; it will honor the http_proxy env var, if defined.

Dimitris

mahesh_varak89 · November 1, 2018, 8:41am

I am able to download the documents-2.json.bz2. Following is the output :

-rw-rw-r-- 1 ec2-user ec2-user 264698741 Nov 1 08:35 documents-2.json.bz2

And I am able to download the data file manually by AWS CLI. Following is the sample command:

aws s3 cp s3://BUCKET_NAME/FOLDER_PATH/company_data.json.bz2 company_data.json.bz2

dliappis · November 1, 2018, 9:11am

Hi,

I observed that the error you are receiving (in the initial comment) is:

Racing on track [company], challenge [index-and-query] and car ['external'] with version [6.2.3].

i.e. Rally fails to download your own company track (sorry, I was under the impression Rally failed to download the default, geonames, track, that's why I asked you to curl exactly that).

So I presume that your company track resides in an S3 bucket of your own.
Rally tries to download things using an http URL, which is what you should be defining as base_url; is your bucket (or object) publicly accessible thought? i.e. if you try to curl -O https://<url_of_your_s3_bucket/... (you'll find the URL in the S3 properties) does this get you the company_data file?

The aws cli tools (aws s3 cp etc.) use the s3://bucket-name/path schema and IAM Roles and Instance Profiles can be adjusted to grant access to a bucket from within an instance, which may explain why you are able to grab the file using aws s3 cp; this doesn't automatically grant access to the object via http, though.

Regards,
Dimitris

mahesh_varak89 · November 1, 2018, 9:32am

when I am trying to download dta file with curl command it through following error :
e.g. > curl -O https://s3.amazonaws.com/BUCKET_NAME/FOLDER_PATH/company_data.json

Error :
<?xml version="1.0" encoding="UTF-8"?>
AccessDeniedAccess Denied7867ADC755A3F2ECVG4Lhh2B4veFd8bhtUk1E8Ew/Kg/CCPCSGDdydMhm5ArinXBBaGnY9bLXAWFXs5ZgEsqPPWOJlQ=

According to above error do I need to make my object public?

dliappis · November 1, 2018, 9:57am

curl -O https://s3.amazonaws.com/BUCKET_NAME/FOLDER_PATH/company_data.json

This doesn't look right a correct http URL for an s3 bucket (the s3:// schema is used by the aws cli command only).

Please refer to: Buckets overview - Amazon Simple Storage Service

to see how to get the URL of the bucket+object and how to make it public.

system · November 29, 2018, 9:57am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
403 Error occur When Rally download track data Elasticsearch rally	5	670	June 13, 2018
Cannot race, worker has exited prematurely Elasticsearch rally	6	129	July 1, 2024
Esrally run failed for "Cannot find documents-2.json.bz2" Elasticsearch rally	54	2970	August 14, 2018
Esrally failed to show result (Cannot race. Load generator [14] has exited prematurely.) Elasticsearch rally	3	947	December 20, 2019
Rally gets stuck while benchmarking Elasticsearch rally	5	1443	October 25, 2018

Rally failed for downloading data file from aws S3

[INFO] FAILURE (took 1 second)

Related topics