Limited read and write speed using Elasticsearch Hadoop

Thijsvdp · September 12, 2022, 10:19am

Hi all,

We are experiencing unexpectedly low read and write speed on our ES cluster using the Elasticsearch-Hadoop connector. We have been reading a lot about the different configurations in the connector and tested around with the values a lot, with limited success. This is our first time we manage the ES cluster and are not sure if we allocate enough resources or that something else causes the issue.

Some specifications about the data in the main index:

Our application is in searching patent data.
Total documents 82575718
Total index size 800GB
1 replica
Average document size ~5Kb
In total we have ~50 fields of which 3 are nested
Our main document describes group of sub-documents in a family <-> member relationship.
Weekly, ~300K documents are indexed
Weekly, ~1M documents are updated

Some specifications about the cluster:

We are running the cluster on Kubernetes in EKS
3 nodes
EBS gp3 storage, which is able to customize IOPS and throughput.
2 CPU, 8GB RAM, 4GB of heap per node

Why we need the Elasticsearch-Hadoop connector?
We are running some expensive analytics on all documents on a regular basis. Results from that run will be stored in Elastic Search again such that we can expose it to our users. In these applications, we are reading all data (only relevant fields) for ES, and writing the results back to ES.

What we have observed:

Indexing rate ~200-300 docs/second
CPU is fully utilized during writing
Reading all the data takes more than 8 hours, after which I stopped it because it took too long.
With default settings the search rate is about 1.5 per sec. I am not 100% sure what this would mean for the scroll API that ES-Hadoop is using
When setting es.input.max.docs.per.partition to some non-default value (say 10.000), the search rate of the index increases (form 1.5 to 15 per sec), and CPU gets more heavily utilized. Also reading all data is about 2 times faster. We tested this on a development cluster with fewer data.

What we have tried:

For writing, changed es.batch.size.entries, but the default showed best performance.
For writing, disabled the refresh_interval temporarily, which helped
For reading, decreased the es.scroll.size from default (1000) to 200, which increased read performance 2 times.
For reading, changed es.input.max.docs.per.partition to 10.000 in order to increase the parallelism.

Questions we have:

Is the current read speed that we are seeing (more than 8 hours to get some fields of all 80 million documents) normal?
Is the current write speed that we are seeing (200-300 docs/second) normal?
Are there any recommendations on the node sizes/configurations that we can change to improve reading and writing?

Thanks in advance!

Christian_Dahlqvist · September 12, 2022, 10:32am

How many primary shards does the index have?

Have you tried increasing the amount of resources available to Elasticsearch? If CPU is fully utilized during indexing it is possible that, or maybe even you storage (Elasticsearch is often very I/O intensive), is the bottleneck. What does iostats look like while the cluster is under load?

Thijsvdp · September 12, 2022, 12:42pm

Currently we are having 150 shards, but that is because we thought the total size would be bigger. We are actually currently reducing it to 50, and will observe the performance then.

To test if I/O would be the bottleneck I have increased the limits for EBS, but it did not show an increased writing speed. Also the statistics do not show any sign of reaching any of the EBS limits for now. I will try to verify the iostats once I am able to install it on the pods.

Regarding the resources, we were thinking the following per node:

4 CPU
32 GB of RAM
16 GB of heap
IOPS 5000 (EBS)
Throughput 250 mb/s (EBS)
In total we would have 3 nodes.

What do you think of these configurations? Any suggestions to improve? Once we are done with benchmarking we'll come back with the results.

Keith_Massey · September 12, 2022, 1:14pm

Another thing that can impact performance is the number of mapreduce tasks that are writing to Elasticsearch. You want a balance of enough so that you are taking advantage of the full cluster and writing to as many shards as possible, but not so many that you overwhelm your Elasticsearch cluster. Are you writing from map tasks or reduce tasks? And how many?

Thijsvdp · September 12, 2022, 1:38pm

I am not sure what you mean with if I am writing from map tasks or reduce tasks. I am aware that the number of tasks that are writing to ES should be limited. To control this, what I do is repartition to a sensible number of partitions right before writing the Spark DataFrame or RDD to ES. Until now, I have found sensible to be to the number of shards (which is 3 on our development cluster). I have tried a bit more than that, but it seems to cause some throttling on our cluster.

Unfortunately I have not yet come to the part to actually write to the big cluster in preprod on a bigger scale, since reading all the data does not even complete within 8 hours. But I am curious to see how this ratio of tasks-to-shards would hold for the bigger cluster. I mean, currently we are having 150 shards on the big cluster and I am interested to see if 150 tasks can write to ES at the same time. Do you have some information/ideas on this?

Keith_Massey · September 12, 2022, 10:07pm

I am not sure what you mean with if I am writing from map tasks or reduce tasks.

That was a mapreduce question -- I hadn't realized you were using spark. It sounds like you're limiting the number of spark tasks that concurrently write to Elasticsearch though.

Thijsvdp · September 13, 2022, 8:27am

@Keith_Massey Yeah I am sorry for the confusion.

I am curious though about the relationship between shards and CPU. According to the documentation and some other discussions each shard runs a search on a single thread. Does this mean that if you have, say 6 shards and 12 threads/vCPU available, it only uses 6/12 threads? Or can it use multiple per shard?

And what is the relation between shards and CPU for writing? Say I have 6 shards, and 12 threads/vCPU available, can I write using 12 Spark tasks? Or will that be useless and overload the cluster?

And last question, what is the relationship between RAM available and shard size?

system · October 11, 2022, 8:28am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk write to ES \| best practices Elasticsearch es-hadoop	4	5525	July 6, 2017
Low performance while searching Elasticsearch	5	508	November 27, 2020
Elasticsearch Write performance for full text record Elasticsearch	3	393	July 6, 2017
Performance Challenge Elasticsearch es-hadoop	6	1081	April 28, 2017
Reading from Elasticsearch index using spark ( es-hadoop ) connectors Elasticsearch es-hadoop	2	1405	March 22, 2022

Limited read and write speed using Elasticsearch Hadoop

Related topics