I have experienced some weird behavior on my development cluster while reading an index using the Elasticsearch-Hadoop connector. I am benchmarking the read performance, by running some scripts that read the full index while tracking some metrics. All seemed fine at first, but after a couple of tests, the read performance started to degrade and the results were getting less and less reproducible.
I am not sure what the issue is, but I noticed that 2 out of 3 nodes are showing 100% RAM usage all the time (so even when idle). It is also these nodes that show almost 0 read I/O. I am not sure what is causing this and how to resolve it.
Here is a screenshot of the cluster statistics when idle:
Well actually the issue is that the benchmarking of reading the whole index in Spark using the Elasticsearch-Hadoop connector gave extremely inconsistent results for the same settings. I expect the total time to be more or less the same for both runs. Or even faster for the later run because of caching.
I tried to find an answer why this is the case and though maybe the RAM was too full. But since you say this looks totally normal, this probably is not the problem.
Let me explain a bit more about what I observed the first run (first screenshot), and the last run (second screenshot). Both runs are using the exact same settings of ES-Hadoop.
First run:
Steady search rate of ~3
Similar high CPU rate of ~75% on all 3 nodes.
Average read I/O of 250 on all 3 nodes.
The full read was done in ~18 minutes.
Last run:
Very bumpy search rate with an average of ~1
Low CPU of ~10% on all 3 nodes.
Very low read I/O on one node, and 0 read I/O on others
The full read took more than 30 minutes and then I stopped it.
This morning I restarted the cluster and ran the tests again, and I cannot reproduce the above things anymore. I am not sure what happened, but if you have any suggestions what could cause such situations that would be very helpful in my understanding.
Another update on the situation. I have been continuing the research by repeating the same experiment with the default configurations four times, and the same thing starts to happen again.
First experiment:
total time 17 mins
search rate starts at 3, and drops to 1 in the last.
Could you possibly provide some more detail here on the use case. Just a quick observation, that is an extremely high IOPS value compared to an extremely low search rate for the majority of use cases (I believe). For someone to help it would be useful to know:
How are these nodes hosted? Self-hosted on-prem? Self-hosted Cloud (AWS/GCP/Azure)? Elastic Cloud?
What type of disk is backing the nodes?
Are these nodes only running Elasticsearch, or are there other processes also running on them?
More detail about your use case(s). If possible, example queries that are being run as part of your use case.
How are your indices setup? I see each node roughly uses ~35GB of disk space, is this just one index with 3 shards (1 shard per node), or some other usage setup?
Yeah I am happy to provide the extra information. Note that I am benchmarking by reading fields of ALL the data using the Elasticsearch-Hadoop connector using Spark. This is (I think) why the search rate is low, and the IOPS is high. Elasticsearch-Hadoop uses the scroll API to batch through the data, and read it into Spark.
Regarding the extra information:
We are having 3 nodes hosted on EKS. Each node currently has (minimum 600m- max 2vCPU), 8GB RAM, 4GB Heap
The storage backing the nodes are EBS GP3 SSD volumes, which are configurable in terms of IOPS and throughput.
The 3 Elasticsearch nodes are 3 different pods running in EKS. However, there are multiple pods running on the same instance. I will be checking by increasing the resource requirements if it helps in this issue.
What I am doing is reading the whole index (only relevant fields) from Elasticsearch, perform expensive computations on them in Spark, and writing some calculated values back to the same index. In this stage I am benchmarking the read speed on this cluster. The query that is being fired is scroll API using:
{"query": {"match_all": {}}}
We have one main index having patent related data with some nested fields and texts included. This is the development cluster so this is a slice of the actual data we are using in production. This main index in the development cluster is ~60 GB in total. We have currently 3 primary shards, one on each node. The tests above were with 1 replica.
Given the use case, I'd recommend looking at a few things:
See if you can use the point in time API, per the scroll API they seem to recommend it for large searches, which appears to be your case.
See about adjusting the size parameter of the search. By default, it is 10, this might be pretty small for you use case, but if your docs are large, you might not want to go to the max 10000 as this could require more RAM.
See about increasing the available system RAM. Elasticsearch is able to leverage the underlying OS/file system cacheing to speed up searching. If you're low on RAM/unable to cache much of the filesystem data, it will take longer (and more IOPS) to search the data directly from disk.
The Elasticsearch-Hadoop connector for Spark uses scroll API. The point in time API the connector does not implement I think.
I have been testing different sizes, and indeed this changes the speed to some extend. However, my main problem is that the performance degrades as I run the experiment more often. After running it the 4th time the performance is so poor that it just seems to hang.
I have increased the RAM but that does not seem to help much. One of odd things I observe is that for the fourth run, there is hardly any IOPS as compared to the first run.
One of odd things I observe is that for the fourth run, there is hardly any IOPS as compared to the first run.
This could actually be the filesystem cache working here.
I'd see if while you're running the tests, what the hot threads show. They're kind of annoying to read, but should show you what the system might be doing at a more detailed level.
The other question. While we've mainly been looking at the Elasticsearch side, have you looked at the Spark side to make sure there isn't anything going on on that end of things? It is weird to me that Elasticsearch is slowing down for no apparent reason. Curious if maybe the spark end isn't getting the same "cleanup" between tests or something similar that could be causing it to slow down what it can process.
Regarding your comment that the problem could be on the Spark/Elasticsearch-Hadoop side. I think this is unlikely, but I think it is indeed good to rule this out completely. Tomorrow morning I will write some code to use the scroll API to fetch all data using the normal API to see if I can reproduce this error also without Spark.
This could actually be the filesystem cache working here.
I was thinking the same. However, I would expect the job to be finished earlier. Or could it somehow be that as RAM fills up with filesystem cache (all nodes are near 100%), that somehow the process gets slower?
Also, another thing to mention here is that when I restart the cluster, the problems persist. However, when I leave the cluster untouched overnight, and start the job again in the morning, everything seems to be working great again.
No, it shouldn't. The memory occupied by the file system cache will be available and freed up if needed. This sounds quite normal.
I have seen this type of behaviour when resources like CPU cycles and IOPS are tied to credits and not necessarily consistently available. EBS GP3 SSD volumes should as far as I know have a fixed floor and not be subject to this. Is it possible CPU allocation might be metered and subject to throttling?
I have double checked if anything can be limiting it. However, our cluster is running on Kubernetes, and we have assigned 4-6 vCPU to each of the pods. Also the underlying EC2 instances are of type *.2xlarge, which don't have CPU credits.
Would it somehow be possible that the threadpool gets exhausted somehow after some time, and/or maybe some processes are keeping on running? I am trying to understand the hot_thread API. I notice on each of the nodes null_Worker-* threads. I am not sure what this is, but I haven't been seeing that before.
Another thing we observed which might be of interest. If we examine the GET /_nodes/hot_threads?threads=999 during the initial runs, we can actually see some [search] threads consuming CPU as expected. However, as we repeat the experiment a couple of times, the performance degrades and the total time increases a lot. If we examine the GET /_nodes/hot_threads?threads=999, we don't see any [search] threads anymore. If I request also the idle threads by GET /_nodes/hot_threads?threads=999&ignore_idle_threads=false, I can see a lot of [search] threads like this:
0.0% [cpu=0.0%, other=0.0%] (0s out of 500ms) cpu usage by thread 'elasticsearch[elasticsearch-0][search][T#8]'
10/10 snapshots sharing following 13 elements
java.base@18.0.1.1/jdk.internal.misc.Unsafe.park(Native Method)
java.base@18.0.1.1/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
java.base@18.0.1.1/java.util.concurrent.LinkedTransferQueue$Node.block(LinkedTransferQueue.java:470)
java.base@18.0.1.1/java.util.concurrent.ForkJoinPool.unmanagedBlock(ForkJoinPool.java:3464)
java.base@18.0.1.1/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3435)
java.base@18.0.1.1/java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:669)
java.base@18.0.1.1/java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:616)
java.base@18.0.1.1/java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1286)
app//org.elasticsearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:152)
java.base@18.0.1.1/java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1062)
java.base@18.0.1.1/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1122)
java.base@18.0.1.1/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base@18.0.1.1/java.lang.Thread.run(Thread.java:833)
Given the low CPU usage, low IO traffic, and lack of any active threads, it does rather look like Elasticsearch is just waiting to be given some work to do and the bottleneck is elsewhere.
You could try using the REST request tracer to log every single request and the corresponding response. That would give a good indication whether the slowness is within Elasticsearch or without.
@DavidTurner Thanks for the suggestion! We examined the requests and indeed found out that response times within ES are quick for the scroll API. We found that requests, however, are coming in too slow. Somehow something is limiting the rate at which API calls to ES can be made. We are not sure yet what it is, but we are examining the different components of the infrastructure. Will update if I know more.
Alright, finally we have tracked down the issue. I will quickly summarize all things discussed here for people that are having similar issues.
We were seeing extreme performance degradation when reading a full index all at once using Spark (Elasticsearch-Hadoop connector). At first all seems good and performance is good. After a while search rate start to drop and there is hardly any evidence of read I/O.
Changing read parameters such as es.scroll.size and es.input.max.docs.per.partition did change search rates, but did NOT mitigate any overall performance degradation.
We were observing that the read performance using Spark came back after a while of the cluster being idle.
We were able to confirm that the problem was not on Elasticsearch side by using the REST request tracer (@DavidTurner). This showed that request-response times of Elasticsearch were great.
We eventually found that our ingress controller of our Kubernetes cluster was underperforming, causing a bottleneck. Probably this is caused by Spark opening a lot of connections, but this we need to verify.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.