Could not clone from 'https://github.com/elastic/rally-tracks'


#1

I'm working with esrally recently, facing problems, and also fixed some of them. But here is one I`m not quite sure.

2016-11-04 03:50:02,686 root ERROR Cannot run subcommand [race].
Traceback (most recent call last):
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/rally.py", line 437, in dispatch_sub_command
racecontrol.run(cfg)
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/racecontrol.py", line 141, in run
raise e
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/racecontrol.py", line 138, in run
pipeline(cfg)
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/racecontrol.py", line 41, in call
self.target(cfg)
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/racecontrol.py", line 101, in from_distribution
return benchmark(cfg, mechanic.create(cfg, metrics_store, distribution=True), metrics_store)
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/racecontrol.py", line 67, in benchmark
t = track.load_track(cfg)
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/track.py", line 206, in load_track
repo = TrackRepository(cfg)
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/track.py", line 308, in init
git.clone(src=self.tracks_dir, remote=self.url)
File "/usr/local/python3.5.2/lib/python3.5/site-packages/esrally/utils/git.py", line 35, in clone
raise exceptions.SupplyError("Could not clone from '%s' to '%s'" % (remote, src))
esrally.exceptions.SupplyError: Could not clone from 'https://github.com/elastic/rally-tracks' to '/home/hadoop/.rally/benchmarks/tracks/default'
2016-11-04 03:50:02,691 rally.main INFO Attempting to shutdown internal actor system.
2016-11-04 03:50:02,768 rally.main INFO Shutdown completed.

I tried to visit 'https://github.com/elastic/rally-tracks' with curl, and it works fine.
The rally version is:

[hadoop@200server bin]$ ./esrally --version
esrally 0.4.3

Running rally command with this:

[hadoop@200server bin]$ ./esrally --pipeline=from-distribution --distribution-version=5.0.0

I've checked my elasticsearch on port 39200, and it's working.

Any comments from you are appreciated!


Your git version is 1.8, but Rally requires at least git 1.9
(Daniel Mitterdorfer) #2

Hi @xihuanbanku,

I am eager to hear about any problems you encounter so I can improve the experience in the future so don't hesitate to share them.

Looks like a git related problem. Did you see anything on standard output in the terminal? Nevertheless, you can issue the same command that Rally issues:

git clone https://github.com/elastic/rally-tracks /home/hadoop/.rally/benchmarks/tracks/default

Can you please execute this command and share the output with me? I guess the error message from git will reveal what's wrong.

Can you also run git --version?

Daniel


#3

Hi, Daniel
Thanks so much for your rapidly reply.

After I tried this:

git clone https://github.com/elastic/rally-tracks /home/hadoop/.rally/benchmarks/tracks/default

I got this:

[hadoop@200server bin]$ git clone 'https://github.com/elastic/rally-tracks' /home/hadoop/.rally/benchmarks/tracks/default
Cloning into '/home/hadoop/.rally/benchmarks/tracks/default'...
fatal: Unable to find remote helper for 'https'

I could figure it out myself it's about my git(I updated git from 1.x to 2.9.3, but I'm afraid I missed some configration. ), Any way, I could use 'git://github.com/elastic/rally-tracks' to make a walkthrough.

Fortunately, I got this on my screen now:

Downloading data from [http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geonames/documents.json.bz2] (189 MB) to [/home/hadoop/.rally/benchmarks/data/geonames/documents.json.bz2] ...

so far so good.:grinning:


(Daniel Mitterdorfer) #4

Hi @xihuanbanku,

this looks good now! Just ask when anything else comes up.

Daniel


#5

@danielmitterdorfer Pretty sure, I will.


#6

Hi @danielmitterdorfer,

Good news is I got my Final score. But I'm a little confused with some of records.
For example:

| Min Throughput | index-append | 30291.9 | docs/s |
| Median Throughput | index-append | 43255.2 | docs/s |
| Max Throughput | index-append | 44741.3 | docs/s |

Does this means Max index speed for my computer is 44741.3 docs/s or not? Cause I configured "in-memory" in rally.ini.
Or maybe you could give me any comment on the configure items in rally.ini. I checked https://esrally.readthedocs.io/en/latest/configuration.html, but got no result, maybe I tried a wrong page?

Another question, now I want to test my es cluster performance. I used following command:

./esrally --pipeline=benchmark-only --target-hosts=192.168.35.12:8200 --offline

The log file told me this:

esrally.exceptions.SystemSetupError: Cannot find track data for distribution version 2.3.5

BTW, the cluster could not reach the internet. So how could I run Rally on this offline machine ?

Frank


(Daniel Mitterdorfer) #7

Hi @xihuanbanku,

Yes, this is indeed the maximum indexing throughput that your system has achieved during the benchmark.

rally.ini is not really meant for you to edit directly but only via esrally configure (or esrally configure --advanced-config) :wink: Having said that, you can edit it directly though but do not remove any key as Rally considers all keys that are stored there mandatory.

"in-memory" don't have anything to do with the Elasticsearch cluster that is benchmarked but with your metrics store. You can choose to use a separate Elasticsearch instance here. Then Rally will store the metrics there instead of just "storing" metrics in-memory. The advantage of that is that you can compare metrics across multiple races (i.e. invocations of Rally).

The cluster that is actually benchmarked, is launched by Rally internally and all its contents are destroyed afterwards (unless you tell Rally not to do so by specifying --preserve-install but that will need several GB of disk space).

On to the other command you've tried:

./esrally --pipeline=benchmark-only --target-hosts=192.168.35.12:8200 --offline

The cluster on 192.168.35.12 does not need access to the Internet (you also should not run Rally on the same machine but on a different one). Only the machine where Rally is running, needs access to the Internet in order to download track data and meta-data. --offline was originally meant to prevent Rally from attempting to download any track data when you develop your own tracks but the most recent changes in Rally have made this flag almost obsolete in my opinion. Btw, on the command line you should have seen also a hint that you should disable offline mode (because it needs to download track data once). So I suggest you run Rally with the --offline flag:

./esrally --pipeline=benchmark-only --target-hosts=192.168.35.12:8200

Daniel


#8

Hi @danielmitterdorfer,

Okey, my bad, I just edited rally.ini myself. But the most important thing is I have to append this to it to run Rally, otherwise, it does not work.

[source]
local.src.dir = /home/hadoop/.rally/src
remote.repo.url=https://github.com/elastic/elasticsearch.git

Is it bug? You missed this in rally.ini by default?

If my system could achive that speed. How could I get the es's configuration? Cause I know that Rally does not only erase the data it tests, but also the configerations in elasticsearch.yml. Or maybe just the default setting for es is enough?

I think you misunderstood me, in fact, 192.168.35.12 is just the machine which running Rally. And es has already installed, maybe 1 month ago, but the index speed is only 5000 docs/s(I wrote a java program to test it). So I want to test its again with Rally. But it's offline, and I got that log message above. Maybe I should download the files from internet and transfer to the following directory on 192.168.35.12?

~/.rally/benchmarks/data


(Daniel Mitterdorfer) #9

Hi @xihuanbanku,

This is normally added automatically by Rally so it seems to be a bug. It would be great if you can help me to reproduce it.

Just run for example:

esrally --pipeline=from-distribution --distribution-version=5.0.0 --preserve-install=yes

At the end of the race (before the score is printed), Rally will print something along the lines of:

[INFO] Keeping benchmark candidate including index at [/Users/dm/.rally/benchmarks/races/2016-11-07-09-00-05/local/tracks/geonames/append-no-conflicts/install] (will need several GB).

In this directory, the whole installlation (including the populated index) is stored. I urge you to delete it when you don't need it anymore as it eats up significant disk space (and Rally will - intentionally - not delete this directory for you).

This raises several questions:

  1. Do you use the same data set? Because it makes a huge difference whether one document is 10 bytes or 10 kilobytes.
  2. Did you test under similar conditions? (Same number of clients?)
  3. Did you account for proper warmup, etc. etc.

I got you now. This is not the intended setup for Rally. It's doable but you have to be careful and you will lose some convenience. You can run Rally once on a machine that is connected to the Internet. Then copy the following folders to the target machine (i.e. 192.168.35.12 in your case):

~/.rally/benchmarks/data
~/.rally/benchmarks/tracks
~/.rally/benchmarks/distributions

But this means that you don't get any updates of tracks (meta-data), ES distributions or data. If it's possible you should use a dedicated machine for the load test driver.

Daniel


#10

Hi @danielmitterdorfer,

I'm afraid I can not reproduce it now :sweat:. But I will pay attention to this during my new test environment.

  1. Yes, I used the same data "documents.json", 2.6G
  2. Yes, both of them are only one client.
  3. What kind of warmup? How to do this?

I tried, and it worked. Final score is 30000 docs/s. So that means, I could index 30k docs per second with Rally. But when I run my Java program again against same es cluster(actually only one client.) and same data, result still 5000 docs/s.
I browsed Rally's source code, but I'm not familiar with python. So I hope you could tell me what is the main steps for Rally to index data to es. I'd like to show you mine(Java program).

  1. Read data(documents.json) line by line.
  2. Create a index request and add it to a bulk request.
  3. When line count reachs 10000, submit the bulk request.
  4. Loop over.

Or, have you ever test es with Java? then you could give me an example.:blush:


(Daniel Mitterdorfer) #11

Hi @xihuanbanku,

Every Java application should be properly warmed up before you start measuring. "Warmup" just means that you give the application some time right after startup and don't consider the samples that you took during that time. So it's just about how you treat the samples you take. Rally has implemented two flavors of warmup:

  • Based on an iteration count: You can tell Rally in the track specification with warmup-iterations how often you want to run an operation but don't consider the results in reporting. This executes the operation but labels it differently. We use this warmup mode for queries.
  • Based on a warmup time period: This tells Rally to wait for a specific time period until it switches from warmup mode to measurement mode (parameter warmup-time-period(in seconds)). Sure, this is machine-dependent and basically geared towards our nightly benchmarking system but it's a pragmatic start. We use this warmup mode for bulk indexing.

You can see an example walk through in the documentation: http://esrally.readthedocs.io/en/latest/adding_tracks.html

The core implementation is in driver.py but I admit I fear it's pretty hard to understand without explanation. So here's an attempt (I just explain the bulk indexing implementation):

First of all, Rally starts N client processes where N is the number of clients you specify in track.json (which is 8 if you didn't change it). I use processes as multi-threading in Python is not well suited here. You can use 8 indexing threads in Java (you seem to use only one thread and I guess this is the main difference).

Each client process reads bulks of 5000 documents from the file and issues one bulk request (so that is already a difference, you use a bulk size of 10.000 if I understood you correctly).

Bulk requests are sent via the Python Elasticsearch client which uses the HTTP protocol (and I guess you use the transport client in Java, so this is another difference).

W.r.t to throughput measurement: This is a throughput benchmark, so I just take the time stamp when Rally issues the request and another one when I receive the response and take the difference. The throughput per client is then bulk size / (tstop - tstart). You can find the benchmarking loop in execute_schedule().

The tricky part is that you are interested in the throughput over all clients and not just per client. That means that you need to aggregate the individual throughput measurements and this can be quite tricky (I fixed several bugs in this area). You can find the implementation of this aggregation in the function calculate_global_throughput() which calculates the throughput in documents per second.

I hope that helps you understand better how Rally measures throughput.

We have some Java based benchmarks but they were intended for benchmarking the transport client against the REST client. You can see the benchmarking code in the Elasticsearch repository. I also wrote a blog post about the benchmark in which you might be interested in.

Daniel


Rally and Httperf
#12

Hi @danielmitterdorfer,

I've read your blog , as you said it is really helpful for me. I can see the comparation of index speed between Transport client and Rest clinet is not far away from each other. Also, I tried new REST client in es 5.0 which using async http request to index data. The request speed is really high(about 20k requests/s in my environment). But I find a new problem. After the requests, I checked data in es, only 1000+ docs indexed, which means a lot of requests were failed. I got a concurrent.Timeoutexception, it seems like my es cluster(actually only one server) could not handle too many requests at the same time.
Do you know why I got this exception? Which parameter should I change to make sure my cluster could handle as much requests as possible?


(Daniel Mitterdorfer) #13

Hi @xihuanbanku,

based on what you describe it's a bit hard to tell but it seems you've overwhelmed the cluster. You should check the HTTP return code for each bulk request. If there was an error you need to inspect each bulk item response and handle it appropriately and retry.

There is also no single knob that you can turn and then your cluster will be faster. The Definitive Guide contains a few tips how to improve indexing performance which you can check.

Daniel

P.S.: I think that this question would (A) be more appropriate in the Elasticsearch instead of the Rally forum and (B) deserve a new topic. :wink: So it would be great if you could mark my answer that has helped you resolve the original problem as a solution and post new questions also in a new topic in the appropriate forum. Thanks!


#14

Hi @danielmitterdorfer,

Yes, you are right. you just remind me that I'm in Rally forum :joy:. I've already asked you so many questions not relate to Rally :joy:.
Many thanks for all your help.:grin:


(system) #15