I have a problem with a Spark job I have created that consumes data from Kafka, write it to Cassandra and then index it to Elasticsearch. Even when there is no data streaming through kafka, the CPU load is sky high.
When I remove the 'save to ES' section, CPU is normal.
I have created a demo app and attached a link to it here.
Can you post a dump of your Cpu; potentially connect to the jvm using
jstack to see what causes the threads to spin the cpu? The demo is great
however as this is a runtime behavior it's kinda hard to reproduce it
outside your environment.
Took a quick look at the stacktrace but the vast majority of threads belong to cassandra or the cassandra connector. That's not to say that Cassandra is to blame here and/or the Elasticsearch connector has no impact rather it's hard to understand what is causing the JVM to eat all the CPU since all these apps look like they are running within the same VM.
Can you try potentially run each app in a separate VM; this will help isolate the hungry CPU process. Further more, while writing data to Elastic, can you run the hot threads API to see how Elasticsearch behaves?
Further more, you can try using Marvel or other monitoring plugins to understand while indexing, what's the impact on Elastic.
Where does the data in Elastic comes from? Can you minimize your example and eliminate for example Cassandra and index only to Elastic. Further more, can you first index the data from HDFS or the file-system and then add Kafka and see whether it makes a difference?
There are a lot of moving parts and it's unclear whether it's a certain component that eats the CPU or whether it's their interaction that it's causing the issue...
As for the small demo I have provided. It writes small amount of data to Kafka topic and then consumes it, write it to cassandra and then index to ElasticSearch.
If I remove the part of indexing to Elasticsearch, CPU is normal!
Also, What bothers more is that after the data has been consumed (after a few seconds) the CPU keeps being very high although there is no processing of data at all!!
I can run the hot threads API, but is it relevant when there is no data at all? (the RDDs are empty)
I will try running Indexing part in a separate JVM.
Interesting. It looks like there might be a tripping point - potentially the network layer/Netty that might cause the issue if multiple instances are running within the same JVM.
This is just a hunch either way, even without this issue I would strongly recommend to run each server / long-running application in its own JVM / space simply for better control/performance.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.