Rally.driver ERROR Could not clear scroll

I'm trying to run the set of built-in benchmarks overnight on EC2 cluster. Using the script :

#!/bin/bash
Tracks=( geonames geopoint logging nyc_taxis percolator pmc )
Challenges=( append-no-conflicts append-no-conflicts-index-only append-fast-no-conflicts append-fast-with-conflicts )

Cars=( default )

#fresh metrics store & logs
curl -XDELETE localhost:9200/*
sudo rm -f .rally/logs/*
sudo rm -rf /var/log/rally
sduo mkdir /var/log/rally

for t in ${Tracks[@]}
do
sudo mkdir -p /var/log/rally/$t -m ugo+rwx
for ch in ${Challenges[@]}
do
sudo mkdir -p /var/log/rally/$t/$ch -m ugo+rwx
for c in ${Cars[@]}
do
echo Running Rally Track=$t Challenge=$ch Car=$c
curl -XDELETE http://[host]:9200/*
esrally --track=$t --challenge=$ch --pipeline=benchmark-only --distribution-version=2.4.2 --target-hosts=[host]:9200 --client-options=timeout:10,request_timeout:10 --report-file=/var/log/rally/$t/$ch/$c --report-format=csv
done
done
done

At some point overnight, the .rally/logs/rall_out.log shows a series of errors like:
2017-04-06 08:32:39,898 rally.driver ERROR Could not clear scroll. This will lead to excessive resource usage in Elastic
search and will skew your benchmark results.

and the race - while still running - seems to be doing pretty much nothing so I'll kill it off and start over.

Any thoughts to try to prevent this?

Hi @mhohara,

I was probably too dramatic with the wording. The most likely cause is that the scroll has timed out and Elasticsearch already cleared it automatically. It is expected that Rally will proceed on any errors. However, it should show up at the end in the "error rate" line of the respective in case there were errors (like 500s, connection errors and the like). Rally captures all these response details per request if you have configured a dedicated Elasticsearch metrics store.

Nevertheless it is something you should look into. I guess that you run on a smaller instance when this happens. If that's the case you should clone the default rally-tracks, configure Rally to use your own track repository clone and lower the target throughput; additionally I might also consider raising the timeout of scrolls in Rally.

Daniel

I was running on m4.2xl, and I'm definitely getting mixed/inconsistent/poor results. Tests don't complete reliably. I bumped up to a c3.8xl, and still some hiccups but I'll try some more tests. Any suggestions on a minimum AWS configuration that you'd think would perform more reliably?

Hi @mhohara,

I usually only benchmark on bare-metal to reduce background system noise as much as possible so I have not really much experience benchmarking on EC2. The i2.xlarge instance type is very similar to what I use for our benchmarks so you might give these a try.

For which track do these problems happen? Some of the benchmarks are really tough for the system to handle with default settings so you could try to benchmark with the 4gheap car instead.

Daniel

I've been having problems getting past the logging test, which runs so long I can't babysit it - and the .rally/logs/rally_out.log didn't help. Since it was AWS, the relevant data was no longer on my shell. I happened to catch it now, there was a File Not Found on the .rally/logs/rally_out.log file, and logging stopped working at that point. Hence by the time I get back to it in the morning, there's no indication that its done anything since no logs are present. The file actually is there, although there are only two lines in it after the fault, and they don't help.
2017-04-11 01:27:25,626 rally.driver INFO LoadGenerator[6] reached join point [JoinPoint(9)].
2017-04-11 01:27:25,626 rally.driver INFO LoadGenerator[2] reached join point [JoinPoint(9)].

The logging test is actually still running though. Screen grab looks like:

[INFO] Racing on track [logging], challenge [append-no-conflicts] and car [external]

Running index-append [100% done]
Running force-merge [100% done]
Running index-stats [100% done]
Running node-stats [100% done]
Running default [100% done]
Running term [100% done]
Running range [100% done]
Running hourly_agg [100% done]
Running scroll [ 0% done]--- Logging error ---
Traceback (most recent call last):
File "/usr/lib64/python3.5/logging/handlers.py", line 72, in emit
self.doRollover()
File "/usr/lib64/python3.5/logging/handlers.py", line 393, in doRollover
self.rotate(self.baseFilename, dfn)
File "/usr/lib64/python3.5/logging/handlers.py", line 113, in rotate
os.rename(source, dest)
FileNotFoundError: [Errno 2] No such file or directory: '/home/ec2-user/.rally/logs/rally_out.log' -> '/home/ec2-user/.
ally/logs/rally_out.log.2017-04-10'
Call stack:
File "/usr/local/bin/esrally", line 11, in
sys.exit(main())
File "/usr/local/lib/python3.5/site-packages/esrally/rally.py", line 590, in main
success = dispatch_sub_command(cfg, sub_command)
File "/usr/local/lib/python3.5/site-packages/esrally/rally.py", line 414, in dispatch_sub_command
race(cfg)
File "/usr/local/lib/python3.5/site-packages/esrally/rally.py", line 362, in race
with_actor_system(lambda c: racecontrol.run(c), cfg)
File "/usr/local/lib/python3.5/site-packages/esrally/rally.py", line 369, in with_actor_system
actors = actor.bootstrap_actor_system(try_join=already_running, prefer_local_only=not already_running)
File "/usr/local/lib/python3.5/site-packages/esrally/actor.py", line 175, in bootstrap_actor_system
"Convention Address.IPv4": "%s:1900" % coordinator_ip
File "/usr/local/lib/python3.5/site-packages/thespian/actors.py", line 621, in init
systemBase, capabilities, logDefs)
File "/usr/local/lib/python3.5/site-packages/thespian/actors.py", line 659, in _startupActorSys
systemBase = sbc(self, logDefs=logDefs)
File "/usr/local/lib/python3.5/site-packages/thespian/system/multiprocTCPBase.py", line 27, in init
super(ActorSystemBase, self).init(system, logDefs)
File "/usr/local/lib/python3.5/site-packages/thespian/system/multiprocCommon.py", line 84, in init
super(multiprocessCommon, self).init(system, logDefs)
File "/usr/local/lib/python3.5/site-packages/thespian/system/systemBase.py", line 333, in init
logDefs)
File "/usr/local/lib/python3.5/site-packages/thespian/system/multiprocCommon.py", line 102, in _startAdmin
admin.start()
File "/usr/lib64/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib64/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/lib64/python3.5/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/usr/lib64/python3.5/multiprocessing/popen_fork.py", line 74, in _launch
code = process_obj._bootstrap()
File "/usr/lib64/python3.5/multiprocessing/process.py", line 254, in _bootstrap
self.run()
File "/usr/lib64/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/site-packages/thespian/system/multiprocCommon.py", line 201, in startAdmin
admin.run()
File "/usr/local/lib/python3.5/site-packages/thespian/system/admin/convention.py", line 618, in run
r = self.transport.run(self.handleIncoming, delay.remaining())
File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/wakeupTransportBase.py", line 69, in run
rval = self._runWithExpiry(incomingHandler)
File "/usr/local/lib/python3.5/site-packages/thespian/system/transport/TCPTransport.py", line 1250, in _runWithExpiry
r = incomingHandler(rEnv)
File "/usr/local/lib/python3.5/site-packages/thespian/system/admin/adminCore.py", line 106, in handleIncoming
self.unrecognized)(envelope)
File "/usr/local/lib/python3.5/site-packages/thespian/system/admin/convention.py", line 806, in h_PendingActor
return super(ConventioneerAdmin, self).h_PendingActor(envelope)
File "/usr/local/lib/python3.5/site-packages/thespian/system/admin/globalNames.py", line 31, in h_PendingActor
return super(GlobalNamesAdmin, self).h_PendingActor(envelope)
File "/usr/local/lib/python3.5/site-packages/thespian/system/admin/adminCore.py", line 311, in h_PendingActor
if sourceHash else None))
File "/usr/local/lib/python3.5/site-packages/thespian/system/multiprocCommon.py", line 345, in _startChildActor
child.start()
File "/usr/lib64/python3.5/multiprocessing/process.py", line 105, in start
self._popen = self._Popen(self)
File "/usr/lib64/python3.5/multiprocessing/context.py", line 267, in _Popen
return Popen(process_obj)
File "/usr/lib64/python3.5/multiprocessing/popen_fork.py", line 20, in init
self._launch(process_obj)
File "/usr/lib64/python3.5/multiprocessing/popen_fork.py", line 74, in _launch
code = process_obj._bootstrap()
File "/usr/lib64/python3.5/multiprocessing/process.py", line 254, in _bootstrap
self.run()
File "/usr/lib64/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.5/site-packages/thespian/system/multiprocCommon.py", line 566, in startChild
am.run()
File "/usr/local/lib/python3.5/site-packages/thespian/system/actorManager.py", line 81, in run
r = self.transport.run(self.handleMessages)

Hi @mhohara,

thanks for the detailed report. I've opened https://github.com/elastic/rally/issues/263 and plan to address this soon.

Daniel

Hi @mhohara,

I've just released Rally 0.5.3. After the upgrade, Rally will write one log file per race and will not attempt to rollover thus avoiding the problem you stumbled upon. Please upgrade and let me know whether all is well now.

Daniel

Thanks Dan. I had another overnight pass hiccup in the same way. I’ve upgraded now, and I’ll let you know if I run into the problem again.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.