Elasticsearch spark very high CPU

Hi,

I have a problem with a Spark job I have created that consumes data from Kafka, write it to Cassandra and then index it to Elasticsearch. Even when there is no data streaming through kafka, the CPU load is sky high.
When I remove the 'save to ES' section, CPU is normal.
I have created a demo app and attached a link to it here.

Thanks,

Israel

Can you post a dump of your Cpu; potentially connect to the jvm using
jstack to see what causes the threads to spin the cpu? The demo is great
however as this is a runtime behavior it's kinda hard to reproduce it
outside your environment.

Thanks,

It happens on 3 different machines (osx, ubuntu and windows7)

Hi Costin,

Attached 3 snapshots of jstack while the demo process is running and the CPU is very high.

Thanks,

Israel

Took a quick look at the stacktrace but the vast majority of threads belong to cassandra or the cassandra connector. That's not to say that Cassandra is to blame here and/or the Elasticsearch connector has no impact rather it's hard to understand what is causing the JVM to eat all the CPU since all these apps look like they are running within the same VM.

Can you try potentially run each app in a separate VM; this will help isolate the hungry CPU process. Further more, while writing data to Elastic, can you run the hot threads API to see how Elasticsearch behaves?
Further more, you can try using Marvel or other monitoring plugins to understand while indexing, what's the impact on Elastic.

Where does the data in Elastic comes from? Can you minimize your example and eliminate for example Cassandra and index only to Elastic. Further more, can you first index the data from HDFS or the file-system and then add Kafka and see whether it makes a difference?

There are a lot of moving parts and it's unclear whether it's a certain component that eats the CPU or whether it's their interaction that it's causing the issue...

Hi Costin,

Thank you for your help.

As for the small demo I have provided. It writes small amount of data to Kafka topic and then consumes it, write it to cassandra and then index to ElasticSearch.

If I remove the part of indexing to Elasticsearch, CPU is normal!

Also, What bothers more is that after the data has been consumed (after a few seconds) the CPU keeps being very high although there is no processing of data at all!!
I can run the hot threads API, but is it relevant when there is no data at all? (the RDDs are empty)
I will try running Indexing part in a separate JVM.

Thanks,

Israel

Try first with a version that simply reads from Kafka and writes to Elasticsearch without Cassandra in between; use as little parts as possible.

will do. thanks

Once I have moved cassandra out of this JVM, everything is back to normal. Still don't know why... anyway, thanks for the help

I mean cassandra server itself. still using cassandra connector to write to cassandra from same JVM

Interesting. It looks like there might be a tripping point - potentially the network layer/Netty that might cause the issue if multiple instances are running within the same JVM.
This is just a hunch either way, even without this issue I would strongly recommend to run each server / long-running application in its own JVM / space simply for better control/performance.