How to do Performance Benchmark Elasticsearch on a Large Single Node Bare metal server?

sylvesterson · December 14, 2018, 7:49pm

Hello Elasticsearch users

I'm trying to do few Performance Benchmark of the current Elasticsearch 6.5.2 (GA) on a Large Single Node Bare metal server. I could see Elasticsearch has some 8 Benchmarks from Geonames to NOAA based data,

Currently I have the following 2 Server(s)

For Elasticsearch Node

2 Socket AMD EPYC 7601 Processor Server (Elasticsearch Node) = 32 Cores/Processor * HT = 64 * 2 Processors = 128 Cores/Threads
DRAM = 256GB
25GigE Mellanox NiC
2 x 1.5 TB NVMe Drive (Local to the Server) (One Drive Data and One Drive for Logs)

For Rally

1 Socket AMD EPYC 7601 Processor Server (Rally Driver Node) = 32 Cores/Processor * HT = 64 * 1 Processor = 64 Cores/Threads
DRAM = 256GB
25GigE Mellanox NiC
1.5 TB NVMe Drive (Local to the Server)

I need some pointers if anyone out there could direct me that'd be great.

Out of the 8 benchmarks listed which one is more CPU bound where I could crunch more data on the available Cores/Threads?
How much of Data will be generated from the top CPU bound benchmark out of these 8 benchmarks? I could add more NVMe drives of similar capacity to the Elasticsearch Node if needed.
Currently I'm running the Elasticsearch Benchmarks as it's defaults with the specific "Challenges" associated to it, but I'm not able to raise the CPU usage at all. If someone out there could provide me some tips on adding any additional parameters such as "cars" / any other specific parameters where I could make the benchmarks to do more aggressive CPU bound on the Elasticsearch server.

Currently I'm using ( jdk1.8.0_191 )
A) 24GB as my Heap Size (-Xms24g -Xmx24g)
B) Since this is AMD 7601 32Core processor with HT is switched ON and makes a total of 128 Threads (2 Physical Sockets) in the Elasticsearch Server, so my NUMA Nodes shows as follows where each NUMA Node could slice the memory up to 32GB (256GB / 8 NUMA Nodes)

C) To make sure Java takes care of the NUMA Nodes correctly, I've set -XX:+UseNUMA flag in (jvm.options) file as well.

numactl -H

available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7 64 65 66 67 68 69 70 71
node 0 size: 32607 MB
node 0 free: 28085 MB
node 1 cpus: 8 9 10 11 12 13 14 15 72 73 74 75 76 77 78 79
node 1 size: 32767 MB
node 1 free: 28670 MB
node 2 cpus: 16 17 18 19 20 21 22 23 80 81 82 83 84 85 86 87
node 2 size: 32767 MB
node 2 free: 28845 MB
node 3 cpus: 24 25 26 27 28 29 30 31 88 89 90 91 92 93 94 95
node 3 size: 32767 MB
node 3 free: 28799 MB
node 4 cpus: 32 33 34 35 36 37 38 39 96 97 98 99 100 101 102 103
node 4 size: 32767 MB
node 4 free: 28719 MB
node 5 cpus: 40 41 42 43 44 45 46 47 104 105 106 107 108 109 110 111
node 5 size: 32767 MB
node 5 free: 28763 MB
node 6 cpus: 48 49 50 51 52 53 54 55 112 113 114 115 116 117 118 119
node 6 size: 32767 MB
node 6 free: 28848 MB
node 7 cpus: 56 57 58 59 60 61 62 63 120 121 122 123 124 125 126 127
node 7 size: 32767 MB
node 7 free: 28725 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 16 32 32 32 32
1: 16 10 16 16 32 32 32 32
2: 16 16 10 16 32 32 32 32
3: 16 16 16 10 32 32 32 32
4: 32 32 32 32 10 16 16 16
5: 32 32 32 32 16 10 16 16
6: 32 32 32 32 16 16 10 16
7: 32 32 32 32 16 16 16 10

Any help / pointers and direction would be greatly appreciated.

This is for our internal testing and to see the performance of our Processors.

Thanks
Sylvester

Christian_Dahlqvist · December 17, 2018, 7:52am

What do you want to get from the benchmark? What is your use-case?

dliappis · December 17, 2018, 9:06am

I totally agree with @Christian_Dahlqvist , it will serve you best if you clarify what is your use-case. You mentioned that you want to see the performance of your processors, but I guess there's something Elasticsearch specific you want to do here, otherwise you'd use one of the standard Linux cpu benchmarking utilities. Some additional in-line answers below:

This depends on the type of operation you want to stress your CPUs with (i.e. bulk indexing or search) and the amount of Rally clients.

For the bulk use case, from the standard tracks nyc_taxis has the largest corpus ~74GiB; for each track you can see the amount of uncompressed data in the corresponding section in track.json in the rally-tracks repo. Additionally http_logs has a fair bit of data, albeit per doc size is a bit small.

You can then specify the amount of indexing clients that Rally will use via the bulk_indexing_clients track parameters (again please refer to the each track README file for more details).

If, on the other hand, you want to stress things on the search side, for example for the nyc_taxis track you'll need to tweak the amount of clients, target-throughput and iterations (see here) again via track parameters.

See the above answer on how to check this based on the uncompressed-bytes property in the track.json file of each track.

Note that 1) depending on the number of replicas you may configure for your Elasticsearch indices, the actual bytes can be a multiple of this 2) Lucene will compress data so the actual bytes used by ES won't be the same as the uncompressed size of the track.

See the above answers regarding tuning the bulk indexing clients and/or clients/target-throughput/iterations for queries. Do not omit to check the performance of your load driver server in terms of cpu/disk/network saturation. Finally, if you are using Rally in daemonized mode and launch ES via Rally, depending on your amount of ram you can specify a different car (see https://github.com/elastic/rally-teams/tree/master/cars/v1) e.g. 16gheap or directly set heap_size by passing it in --car-params="heap_size:'16g'".

sylvesterson · December 17, 2018, 8:01pm

Hi @Christian_Dahlqvist and @dliappis, Thanks for your responses and the goal/use-case is we 're trying to do a platform centric benchmarking on a dual socket AMD 7601 EPYC processor(s) and find it's capabilities apart from the available Linux cpu benchmarking utilities. We chose elasticsearch as our Search Platform to perform the benchmarks on your given Tracks/Races with Rally. Since these new EPYC processors has more Cores and Threads combined and want to measure the CPU and it's performance in terms of

% of Processor Time during the Search Engine Platform performs various "Challenges".
System Calls / Second (if there are any measurements available through your report)

From your MD report:

In your Track/Race's MD output shows "Median CPU usage" with a percentage for example I got
"Geopoint/append-no-conflicts-index-only" shows as 1193.53% which means Elasticsearch process based on a one second sample period, so I have a total 128 Threads/Cores how it gets translated into 1193.53%, given that this report shows that "Total indexing time" is 25.9801 Minutes. Can you please clarify how the formula of "cpu_utilization_1s says as: CPU usage in percent of the Elasticsearch process based on a one second sample period. The maximum value is N * 100% where N is the number of CPU cores available." how it's got calculated here. This could help as well as this is the only one metric directly pointing to CPU cycle usage on your MD report.

Also I'm looking into your other valuable suggestions on nyc_taxis and multiple replicas (the caveat is I have this one big server (where Elasticsearch Engine Runs) and I do not have another server for replica or Can I use the same server to have multiple replicas? Please advise.)

Other information

Currently my heapsize is set to 24GB at the jvm.options file itself.
The Load server (Where Rally Runs) is another Single Socket AMD EPYC 7601 32 Core (Single Socket), so with HT it has 64 Threads and has 256 GB DRAM as well. so when I measured the CPU/MEM/Network/Disk Usage it's very minimal utilization.

Whenever I run the benchmark I clear the following entries as well on the LoadServer

3.a) echo 3 >'/proc/sys/vm/drop_caches' (Clearing PageCache, dentries, inodes)
3.b) swapoff -a && swapon -a (and Swap space)

So if you could let me know on the "Median CPU Usage" % how it got derived, which will help me put some thoughts around these 128 Cores/Threads ad any other parameters/events which measures the CPU would be awesome.

Thanks and much appreciated.

Best Regards
Sylvester

dliappis · December 19, 2018, 7:31am

cpu_utilization_1s is a metric that uses psutil.cpu_percent(1).

Quoting the doc page:

Return a float representing the current system-wide CPU utilization as a percentage. When interval is > 0.0 compares system CPU times elapsed before and after the interval (blocking).

So Rally stores the 1s samples (either in its in-memory store or an Elasticsearch metric store, depending on what you configured) per Elasticsearch PID that it launched and at the end calculates the Median cpu usage based on this metric.

My previous answer hopefully covered this. Let me suggest here that instead of relying on Rally's Median CPU reporting alone, I think you'll benefit from installing something like metricbeat on your target node, collect detailed analytics using its system module and benefit from the wealth of information it brings.

system · January 16, 2019, 7:37am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
CPU Usage of elasticsearch process is almost 400% while doing performance Testing Elasticsearch	10	763	July 6, 2017
High cpu usage on large ec2 nodes Elasticsearch	5	1930	July 6, 2017
Benchmark scripts/code? Elasticsearch	10	505	July 6, 2017
ElasticSearch Benchmark Elasticsearch	11	580	July 6, 2017
ElasticSearch High CPU usage 160 queries per second doesn't make sense Elasticsearch	1	996	July 6, 2017

How to do Performance Benchmark Elasticsearch on a Large Single Node Bare metal server?

numactl -H

Related topics