Quantifying Elasticsearch Storage Performance

Hi Everyone!

I'm posting the results of some tests that I performed to compare a cluster running SSD versus HDD storage. I hope that the community finds it useful. Caveat: The environment, scripts, and tests were built shortly before Rally was released, so I plan to integrate Rally into future testing.

Overview

This topic was specifically written to quantify search performance using SSD versus HDD for time-series data using the pre-built modules/plugins for data from ArcSight connectors.

Best practices for search workloads [https://www.elastic.co/guide/en/cloud-enterprise/current/ece-planning-hw.html] dictate that solid state drives (SSDs) should be used. However, given the cost of using SSDs (especially on a large cluster), it is helpful to be able to quantify how much faster Elasticsearch is when using SSD storage.

How the tests were conducted

A 3 node cluster was indexing events generated by a shell script at a rate of about 1600 Events per second (EPS), and executing about 225 searches per second from a shell script over a time window containing about 80GB of event data. All tests were executed within Amazon Web Services (AWS) Elastic Compute Cloud (EC2). The specific configuration of each EC2 instance used for testing is described in the table below.

How Elasticsearch nodes were configured

The following parameters were applied to the Elasticsearch nodes prior to testing

Added the following to limits.conf:

elasticsearch - nofile 65536

Executed the following from the command line:

sysctl -w vm.max_map_count=262144

curl -X DELETE 'http://localhost:9200/_all'

Added/updated the following in elasticsearch.yml:

cluster.name: essearchtest

node.name: ${HOSTNAME}

network.host: 172.31.18.187

discovery.zen.ping.unicast.hosts: ["172.31.18.187", "172.31.30.180", "172.31.19.138"]

discovery.zen.minimum_master_nodes: 2

How Logstash was configured

The following parameters were applied to the Logstash node prior to testing

Executed the following from the command line:

./logstash --modules arcsight --setup -M "arcsight.var.inputs=smartconnector" -M "arcsight.var.input.smartconnector.port=5000" -M "arcsight.var.elasticsearch.hosts=172.31.18.187:9200" -M "arcsight.var.elasticsearch.username=elastic" -M "arcsight.var.elasticsearch.password=changeme" -M "arcsight.var.kibana.host=172.31.17.6:5601" -M "arcsight.var.kibana.username=elastic" -M "arcsight.var.kibana.password=changeme"

How time-series data was generated

A shell script was used to generate raw time-series data that was sent to an ArcSight connector, which in turn forwarded the data to Logstash as CEF.

How searches were generated

The dsl_leadingwildcard.sh shell script was used to generate a steady search load against the cluster.

Here is the code the behind the dsl_leadingwildcard.sh shell script. The script continuously builds new searches using a leading wildcard search with a randomized field value. The search is temporarily stored in a file which is then read by the curl command that sends the search to the cluster.

#!/bin/bash
while true
do
echo '
{
  "query": {
' > tempquery
echo '        "wildcard" : { "deviceExternalId" : "*A'"$((RANDOM%99))"'?" }' >> tempquery
echo '     }
}
' >> tempquery
curl -X GET "172.31.18.187:9200/_search" -H 'Content-Type: application/json' -d @tempquery -s
sleep .1
done

Analyzing cluster statistics

The SSD (io1) test was executed first at around 21:00. The nodes were then migrated over to magnetic HDD (sc1) storage, and retested at around 22:45. You can see from the chart in Kibana below, that while performance was similar between the first and second test, the second test using magnetic storage resulted in more erratic performance for both indexing and searching operations.

image
Figure 2: Overview dashboard showing both the SSD and HDD test runs

image
Figure 3: Advanced dashboard showing both the SSD and HDD test runs

image
Figure 4: Cluster overview dashboard showing stats for the indices

While the monitoring dashboards in Kibana were certainly helpful, it was difficult to measure short duration spikes in search completion time. I needed a way to get more detailed data, and preferably from a more direct source.

Solution

When using curl to send a search to a cluster, it will return the search completion time in milliseconds to the command line. To ensure I could track the completion time directly from every search that was run during testing, and measure short duration spikes in search completion time, I piped the curl command into grep to write out a log of the time it “took” to complete each search. The new dsl_leadingwildcard.sh script looked like this:

#!/bin/bash

while true

do

echo '

{

"query": {

' > tempquery

echo ' "wildcard" : { "deviceExternalId" : "*A'"$((RANDOM%99))"'?" }' >> tempquery

echo ' }

}

' >> tempquery

curl -X GET "172.31.18.187:9200/_search" -H 'Content-Type: application/json' -d @tempquery -s | grep -o '"took":[0-9]*' >> /root/dsl_wildcard_hdd.log

sleep .1

done

I wrote one log during SSD testing and a separate log during HDD testing. I then used Microsoft Excel to create the charts below. Search times are shown in milliseconds on y-axis, while the x-axis shows the line number from the log.


Figure 5: Search completion time using SSDs


Figure 6: Search completion time using HDDs

Conclusion

While Elasticsearch v6.3 performed well using both storage types, the HDD storage was shown to be up to 10 times slower than SSD at various points during testing. It took up to 4.5 seconds to search approximately 16 million events (80GB) using HDD storage. Additionally, both the Kibana dashboards and metrics from the command line showed that using HDD storage resulted in larger latency fluctuations for both searching and indexing while under load. Based on the test results, it’s easy to see why SSDs are recommended for nodes with search workloads.

4 Likes

Great post. Thanks for sharing!

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.