IO / Disc "tear down" for elastic search

Oleg_Ruchovets · September 15, 2020, 7:43pm

But running on the same machine is kind of "Shoot in your own leg " , right? I checked the network - between machines it is 4 Gb /s it is way, way more bandwidth that I generate.

Also, is there a way I fix the exception?

Again I am trying to follow the same command, running the same load but it is 20 times less then @dliappis has just generated.

Something very basic is wrong but what exactly I don't know. I am very new to Rally.

Thanks

Christian_Dahlqvist · September 15, 2020, 8:09pm

If you are looking to simulate a production scenario and are looking for realism it is indeed a very bad thing to do. Given that this is a very artificial load and scenario anyway I do not see a problem with trying it and see what it gives.

dliappis · September 16, 2020, 8:02am

You are running an old version of Rally. Can you switch to the latest (2.0.1)?
Also, which version of Elasticsearch are you targeting?

Oleg_Ruchovets · September 16, 2020, 11:38am

I installed Rally on the same host with ES.

user@oleg-elastic1:~/.rally/logs$ esrally --version
esrally 2.0.1

Elastic Search Version:
** elasticsearch-7.8.1**

running command:
esrally --track=pmc --track-params="bulk_size:2000,bulk_indexing_clients:16" --target-hosts=localhost:9200 --pipeline=benchmark-only

       6.10    0.00    2.12    7.47    0.00   84.31

Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 3775.00 0.00 42.52 0.00 4674.00 0.00 55.32 0.00 1.90 6.27 0.00 11.53 0.25 95.20

load statistics are the same:
iowait - 5.10
MBs - 42
w/s - 3500

Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/esrally/async_connection.py", line 130, in perform_request
raw_data = yield from response.text()
File "/home/user/.local/lib/python3.8/site-packages/async_timeout/init.py", line 45, in exit
self._do_exit(exc_type)
File "/home/user/.local/lib/python3.8/site-packages/async_timeout/init.py", line 92, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2020-09-16 11:16:25,190 -not-actor-/PID:26645 elasticsearch WARNING POST http://localhost:9200/_bulk [status:N/A request:60.060s]
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/esrally/async_connection.py", line 129, in perform_request
response = yield from self.session.request(method, url, data=body, headers=headers, timeout=request_timeout)
File "/home/user/.local/lib/python3.8/site-packages/aiohttp/client.py", line 504, in _request
await resp.start(conn)
File "/home/user/.local/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 847, in start
message, payload = await self._protocol.read() # type: ignore # noqa
File "/home/user/.local/lib/python3.8/site-packages/aiohttp/streams.py", line 591, in read
await self._waiter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/esrally/async_connection.py", line 130, in perform_request
raw_data = yield from response.text()
File "/home/user/.local/lib/python3.8/site-packages/async_timeout/init.py", line 45, in exit
self._do_exit(exc_type)
File "/home/user/.local/lib/python3.8/site-packages/async_timeout/init.py", line 92, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2020-09-16 11:16:30,101 -not-actor-/PID:26644 elasticsearch WARNING POST http://localhost:9200/_bulk [status:N/A request:60.008s]
Traceback (most recent call last):
File "/home/user/.local/lib/python3.8/site-packages/esrally/async_connection.py", line 129, in perform_request
response = yield from self.session.request(method, url, data=body, headers=headers, timeout=request_timeout)
File "/home/user/.local/lib/python3.8/site-packages/aiohttp/client.py", line 504, in _request
await resp.start(conn)
File "/home/user/.local/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 847, in start
message, payload = await self._protocol.read() # type: ignore # noqa
File "/home/user/.local/lib/python3.8/site-packages/aiohttp/streams.py", line 591, in read
await self._waiter
asyncio.exceptions.CancelledError

running the command you send yesterday:
esrally --pipeline=benchmark-only --track=eventdata --track-repository=eventdata --challenge=bulk-update --track-params=bulk_size:10000,bulk_indexing_clients:16 --target-hosts=localhost:9200 --client-options="timeout:240" --kill-running-processes

logs are without exceptions.

performance is the same:
avg-cpu: %user %nice %system %iowait %steal %idle
3.88 0.00 0.89 5.79 0.00 89.45

Device r/s w/s rMB/s wMB/s rrqm/s wrqm/s %rrqm %wrqm r_await w_await aqu-sz rareq-sz wareq-sz svctm %util
sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 3804.00 0.00 41.15 0.00 5383.00 0.00 58.59 0.00 0.85 2.33 0.00 11.08 0.21 78.00

load statistics are the same:
iowait - 5.10
MBs - 42
w/s - 3500

how this can be?

Is there something I need to check?

Oleg_Ruchovets · September 16, 2020, 11:49am

Also , Here is the resource consumption of the machine:

ES ( java ) is loaded but rally process is almost not consuming resources.

however, I see many esrally processes:

What could be a problem?

dliappis · September 16, 2020, 12:40pm

This tells us that at least one bulk request took longer than 60s --> Elasticsearch is struggling.
Note you can increase the timeout via client-options, but you should probably lower the # of bulk clients; best leave both bulk_size and # of indexing clients to default and start increasing little by little after you've had a good grasp of where your resources go.

Rally having many processes and little CPU usage is good; you've specified 16 clients so at the very least you'll see the min(num_of_cpus, 16) as Rally processes.

To get reasonable idea of your server resource usage I suggest the USE method and the commands mentioned: Linux Performance Analysis in 60,000 Milliseconds | by Netflix Technology Blog | Netflix TechBlog or alternatively install metricbeat: Metricbeat quick start: installation and configuration | Metricbeat Reference [7.9] | Elastic

When you are running iostat, are you certain you are running it against the disk that Elasticsearch writes to? I don't know if you have attached nvme/ssd scratch disks, is Elasticsearch writing to those?

Oleg_Ruchovets · September 16, 2020, 12:55pm

sda is a disk coming with VM - this is where ES is installed ( it mostly idle)
sdb is where data is written - it is attached drive. I am trying to cause this drive to "work hard"

My problem is that I can't reach a fraction of what you are generating. 40MB/s and around 3500 write per second is very low thruput to reach disk saturation. and after plenty of different options, (even using multiple machines ) I still have the same load.

What are the points to check to get a load similar to what you've got?

(The only change I've done is OS paging cache is disabled, but when it was enabled it was 300 writes per second - which is nothing).

Christian_Dahlqvist · September 16, 2020, 1:02pm

Record disk utilization during an indexing run. In one of your data points above it was close to 100%. Elasticsearch does a lot of quite small random reads and writes so you are unlikely to reach throughput specified by the manufacturer, which are often measured using large sequential loads.

Oleg_Ruchovets · September 16, 2020, 1:15pm

Ok , may you please share how to do it? Is there specific rally job / data set I need to run.
Again I think I am doing wrong something very basic . I saw I am indexing a millions of documents or even 10s millions of documents.

How this can be running the same command we've got 15 times differences? In general - I have only one machine and I try to "kill it with io load" and I can't get a fraction to its capacity. Sounds I am doing something wrong ? :-). I understand ( 32 CPU vs 16 but still we are running 16 rally processes and processes are doing nothing - resources utilized max 20 %)

May you guys share more details where is the difference in configuration/version/setup, etc... or another way - how to generate the load similar and higher to what @dliappis shared

Christian_Dahlqvist · September 16, 2020, 1:19pm

Look at the disk utilization while you are running, e.g. using iostat. Some of the data points indicated the storage is quite heavily utilised and possibly the bottleneck.

Oleg_Ruchovets · September 16, 2020, 1:24pm

I am running iostat during all the tests I am doing these days. it didn't reach higher than 15% iowait on the peak and writes per second was 500. after I disable page caching become around 3000 writes per second.
Again it is 40 Mb/s - which is very low for load testing and since a very basic smoke test cause for @dliappis to generate 600MB/s I want to understand how to get the same result or even higher.

Christian_Dahlqvist · September 16, 2020, 1:25pm

I asked about disk utilization, not iowait.

Oleg_Ruchovets · September 16, 2020, 1:28pm

May you please point me which statistic on iostat you are referring to?

Where it was almost 100%?

Thanks

Christian_Dahlqvist · September 16, 2020, 1:38pm

The rightmost column in the iostat output, ideally tracked over time.

Oleg_Ruchovets · September 16, 2020, 2:42pm

Ok, Thank you @Christian_Dahlqvist for pointing me on %utils, I will take a closer look at these statistics.
But still, I want to get more load on the system. There are 70% - 80 % of free resources.
How can I get a higher load? How can I reach close to what @dliappis had?

Christian_Dahlqvist · September 16, 2020, 3:02pm

I do not understand the purpose of this exercise. If you just want to see what throughput the disk can support I would recommend using fio which allows you to simulate different access patterns. Given that you are not simulating any realistic use case or scenario I do not understand why you are using Elasticsearch in this test.

dliappis · September 16, 2020, 3:09pm

For the record %util in iostat is not a useful metric for flash memory type storage due to parallelism used in those devices; details here. It's a useful metric for rotational disks.

Did you try the same benchmark on a different machine? Did you try the index-and-query-logs-fixed-daily-volume challenge using e.g. esrally --track-repository=eventdata --track=eventdata --challenge=index-and-query-logs-fixed-daily-volume --track-params="number_of_days:10,daily_logging_volume:100GB" --pipeline=benchmark-only?

During execution you must record all load metrics (iostat -x, mpstat -P ANY, sar -n DEV,vmstat) to see if you are hitting machine bottlenecks.

dliappis · September 16, 2020, 3:10pm

Also agreed with Christian that fio is a much better tool to test the maximum disk performance for different disk patterns.

Oleg_Ruchovets · September 17, 2020, 12:29pm

Hello, Really appreciate detailed answers.
running the command I've got the same exception mentioned before - can't execute it.

I installed ES and Rally on the same machine (google cloud):
c2-standard-60 (60 vCPUs, 240 GB memory), 3TB disk

running command
esrally --pipeline=benchmark-only --track=eventdata --track-repository=eventdata --challenge=bulk-update --track-params=bulk_size:10000,bulk_indexing_clients:32 --target-hosts=localhost:9200 --client-options="timeout:240" --kill-running-processes

left it to run for quite some time - 24 hours as you recommended.
running around 1 hour:
I do see writes around 400Mb from time to time using iostat but iowait% is very low. 2-3% and utils% is 30 - 40 %

@Christian_Dahlqvist
@dliappis

Just wanted to explain the purpose of these tests:
We are doing comparison different options for disks/storages for ES and my task is to reach the disk/io bottleneck. This task is not just to kill the disk but bring ES the limit of I/O and Disk. Hope this will make more sense why I am running these experiments

Thanks

Oleg_Ruchovets · September 18, 2020, 1:14am

Hello ,
I am running the Rally for around 15 hours.

esrally --pipeline=benchmark-only --track=eventdata --track-repository=eventdata --challenge=bulk-update --track-params=bulk_size:10000,bulk_indexing_clients:32 --target-hosts=localhost:9200 --client-options="timeout:240" --kill-running-processes

ES:

I do get a much higher load after sometime

I am running the bulk_size challenge the first time as is but I do want to test it with the changes you recommended above. Is it possible to pass id_seq_low_id_bias to true and probability as a parameter. May you please give me an example of how to deal with probability tweaks and what is the logic behind them.

Thanks in advance.

Topic		Replies	Views
Bulk is too slow Elasticsearch	34	16808	December 14, 2017
ES Benchmark using rally to stress a 2 node setup Elasticsearch rally	6	2572	November 8, 2018
The problem about runing a big dataset Elasticsearch rally	10	1839	January 23, 2017
Could not clone from 'https://github.com/elastic/rally-tracks' Elasticsearch rally	14	4025	July 5, 2017
Rally Actors exit prematurely Elasticsearch rally	15	1547	May 31, 2018

IO / Disc "tear down" for elastic search

Related topics