Maximize read/write throughput

We are performing load test on elasticsearch cluster and trying to maximize read/write throughput.

Here are the details of our elasticsearch cluster:

  • Master nodes : 3
  • Data nodes : 6
  • Indexes : 1
  • Primary Shards : 2
  • No of Replicas : 2

Data Nodes Hardware Configuration:

  • CPU : 16 cores (1:1 vcpu commit)
  • RAM : 112 GB
  • Network : 1 GBps
  • Disk Size : 1 TB (SSD)
  • Throughput per disk : 200 MB/s (We are using the P30 Azure Premium SSD)

Master Nodes Hardware Configuration:

  • CPU : 8 cores (1:1 vcpu commit)
  • RAM : 28 GB
  • Network: 1 GBps

Client topology :

  • Client nodes: 12
  • Write threads per node: 20
  • Read threads per node: 40

Hardware configuration for nodes in ES Client Nodes:

  • CPU : 16 cores (1:1 vcpu commit)
  • RAM : 112 GB
  • Network : 1 GBps

Bulk Processor settings:

  • Average document size : 1 KB
  • Concurrent Requests : 0
  • Bulk Actions: -1
  • Bulk Size: 15 MB
  • Back Off Policy: Constant Back Off with a delay of 2 seconds and 3 maximum retries
  • Flush Interval: 5 seconds

Is there a question related to this? Are you facing any issues?

The question here is what changes should I make in the above settings to achieve maximum read and write throughputs? Also I wanted to understand the difference between BulkProcessor, and BulkRequest. How using one over another affects the performance of ES?

What is the use case? What does the expected workload look like? What are the requirements? How many queries per second are you expecting? What type of queries? How much data do youy expect to hold in the cluster?

It is impossible to give any reasonable recommendations without knowing a lot more about the use case and the requirements you are optimizing towards.

Perhaps a quick read through this can help

https://www.elastic.co/guide/en/elasticsearch/reference/current/how-to.html

Also there are a number of key settings which are discussed here , here and here

These should help with some best practice setting

And really what @Christian_Dahlqvist is saying is that to get the metrics you are looking for you are going need to setup your cluster and use case and benchmark it. Then when there are specific tuning questions then perhaps folks can be more helpful.

1 Like

With respect to general guidelines I would also recommend this guide on tuning for indexing and this on tuning for search speed.

1 Like

@Christian_Dahlqvist, @stephenb - We have a following usecase:

We are running a load test where in we are ingesting 1KB documents for 24 hours. We are targeting an index size of 1.8 TB at the end of 24 hours. In order to do this ingestion we are using 12 clients.

We are also sending read requests while doing ingestion. We are sending following types of queries:

  1. Simple query - Query on a field:value where field and value are chosen randomly.
  2. Boolean queries - Queries with AND, OR

Type of query is randomly selected from above queries. We are targeting read throughput of 20,000 queries/second.

In order to achieve above performance, what should be the ideal settings for our cluster as well as client? Please refer to the original post for the hardware configurations that we are using.

We have gone through the links you guys have provided. In most of these links, performance is being measured only for indexing or querying. However our aim is to get benchmarks for a usecase where in querying and indexing is happening simultaneously.

We ran one load test where in we were able to index the 1.8 GB but the read throughput was close to negligible. One thing thing that we observed was that the CPU was spiking a lot. CPU was close to 90-95% for the entire duration of load test.

What is the total expected index size queried? How many events per second are you expecting to index? Is the 1.8TB per 24 hours ingested data volume, primary index size on disk or total index size on disk?

Are your searches hitting all indices and shards or just a subset? How many documents do the searches return on average?

Given that it sounds like you have a reasonably large data set that that can not necessarily be fully cached on the nodes I would suspect this will be limited by disk I/O. As you are continuously indexing a good amount of data and have a very high query rate, there is a lot of competition for the disk I/O.

Indexing can be very I/O intensive but generally writes reasonably large chunks at a time which is nice and efficient for disks. Queries on the other hand will result in a lot of small random reads as index files are loaded and documents to be returned retrieved from different parts of the disk. Given the nature of the mixed load it will be IOPS and not ideal disk throughput that is likely to be limiting. The throughput of disks is generally measured with favorable large sequential operations, so you IMHO are unlikely to get anywhere near that with this type of load.

If we look at the disks you have been using they only support 5,000 IOPS, which is much lower than most locally attached SSDs. Before I saw this I was expecting you to need a quite large cluster to handle this but given your disk performance I suspect the cluster would need to be immense on this type of storage (possibly multiple clusters due to node count). It does however look like Azure are offering Ultra disks which have significantly better performance, so that is what most likely would be required for this type of use case, and that would probably change the picture.

You should be able to see the impact on disk I/O by running iostat -x while you are benchmarking with a mixed load. I would expect you to see a close to 100% disk utilization with a quite high iowait, which would indicate that the disk is indeed the bottleneck.

@Christian_Dahlqvist: I tried running the iostat -x when the load test is running and got the below output.

I don't think that the disk is the bottleneck.

The write throughput after the test was started:

The read throughput after the test was started:

We are able to index at the speed that we want but are not able to query. The read requests are really low. Is there anything that could be an issue here or something that you feel should be changed in the settings or so. And let me know if you need anything else from my side to debug this.

Thanks,
Suril

How many queries per second were you attempting when that was captured? How much data did you have on disk at that time? Did you make sure you have the same amount of data to query as you expect to have in the live cluster?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.