We are experiencing unexpectedly low read and write speed on our ES cluster using the Elasticsearch-Hadoop connector. We have been reading a lot about the different configurations in the connector and tested around with the values a lot, with limited success. This is our first time we manage the ES cluster and are not sure if we allocate enough resources or that something else causes the issue.
Some specifications about the data in the main index:
Our application is in searching patent data.
Total documents 82575718
Total index size 800GB
Average document size ~5Kb
In total we have ~50 fields of which 3 are nested
Our main document describes group of sub-documents in a family <-> member relationship.
Weekly, ~300K documents are indexed
Weekly, ~1M documents are updated
Some specifications about the cluster:
We are running the cluster on Kubernetes in EKS
EBS gp3 storage, which is able to customize IOPS and throughput.
2 CPU, 8GB RAM, 4GB of heap per node
Why we need the Elasticsearch-Hadoop connector?
We are running some expensive analytics on all documents on a regular basis. Results from that run will be stored in Elastic Search again such that we can expose it to our users. In these applications, we are reading all data (only relevant fields) for ES, and writing the results back to ES.
What we have observed:
Indexing rate ~200-300 docs/second
CPU is fully utilized during writing
Reading all the data takes more than 8 hours, after which I stopped it because it took too long.
With default settings the search rate is about 1.5 per sec. I am not 100% sure what this would mean for the scroll API that ES-Hadoop is using
When setting es.input.max.docs.per.partition to some non-default value (say 10.000), the search rate of the index increases (form 1.5 to 15 per sec), and CPU gets more heavily utilized. Also reading all data is about 2 times faster. We tested this on a development cluster with fewer data.
What we have tried:
For writing, changed es.batch.size.entries, but the default showed best performance.
For writing, disabled the refresh_interval temporarily, which helped
For reading, decreased the es.scroll.size from default (1000) to 200, which increased read performance 2 times.
For reading, changed es.input.max.docs.per.partition to 10.000 in order to increase the parallelism.
Questions we have:
Is the current read speed that we are seeing (more than 8 hours to get some fields of all 80 million documents) normal?
Is the current write speed that we are seeing (200-300 docs/second) normal?
Are there any recommendations on the node sizes/configurations that we can change to improve reading and writing?
Have you tried increasing the amount of resources available to Elasticsearch? If CPU is fully utilized during indexing it is possible that, or maybe even you storage (Elasticsearch is often very I/O intensive), is the bottleneck. What does iostats look like while the cluster is under load?
Currently we are having 150 shards, but that is because we thought the total size would be bigger. We are actually currently reducing it to 50, and will observe the performance then.
To test if I/O would be the bottleneck I have increased the limits for EBS, but it did not show an increased writing speed. Also the statistics do not show any sign of reaching any of the EBS limits for now. I will try to verify the iostats once I am able to install it on the pods.
Regarding the resources, we were thinking the following per node:
32 GB of RAM
16 GB of heap
IOPS 5000 (EBS)
Throughput 250 mb/s (EBS)
In total we would have 3 nodes.
What do you think of these configurations? Any suggestions to improve? Once we are done with benchmarking we'll come back with the results.
Another thing that can impact performance is the number of mapreduce tasks that are writing to Elasticsearch. You want a balance of enough so that you are taking advantage of the full cluster and writing to as many shards as possible, but not so many that you overwhelm your Elasticsearch cluster. Are you writing from map tasks or reduce tasks? And how many?
I am not sure what you mean with if I am writing from map tasks or reduce tasks. I am aware that the number of tasks that are writing to ES should be limited. To control this, what I do is repartition to a sensible number of partitions right before writing the Spark DataFrame or RDD to ES. Until now, I have found sensible to be to the number of shards (which is 3 on our development cluster). I have tried a bit more than that, but it seems to cause some throttling on our cluster.
Unfortunately I have not yet come to the part to actually write to the big cluster in preprod on a bigger scale, since reading all the data does not even complete within 8 hours. But I am curious to see how this ratio of tasks-to-shards would hold for the bigger cluster. I mean, currently we are having 150 shards on the big cluster and I am interested to see if 150 tasks can write to ES at the same time. Do you have some information/ideas on this?
I am curious though about the relationship between shards and CPU. According to the documentation and some other discussions each shard runs a search on a single thread. Does this mean that if you have, say 6 shards and 12 threads/vCPU available, it only uses 6/12 threads? Or can it use multiple per shard?
And what is the relation between shards and CPU for writing? Say I have 6 shards, and 12 threads/vCPU available, can I write using 12 Spark tasks? Or will that be useless and overload the cluster?
And last question, what is the relationship between RAM available and shard size?