Read from spark big index and memory usage

ebuildy · March 3, 2022, 7:43am

We use Spark to read a full index of 80Go then write data as a parquet file.

I see memory consumption is going up until 24G per Spark worker.

Hence my question: during the scroll search (1 hour), is the data stored in JVM heap on es4Hadoop side?

Keith_Massey · March 3, 2022, 2:43pm

We store one batch of documents (the number of documents specified in es.scroll.size) in heap memory in es-hadoop. That's typically nowhere near 24G.
Without having seen your code, here are some thoughts:

What are you specifying for your executor memory size? Is it possible that the garbage collector is just waiting until you've used all the memory?
What is your batch size (es.scroll.size)? The default went from 50 to 1000 in 8.0, so if you just recently upgraded to 8.0 you might be seeing more memory use (and better performance) than before. If your average document size is unusually large you might want to reduce that.
Is it possible you're not letting go of memory somewhere? Can you reproduce the memory leak in a code snippet you can paste here?

ebuildy · March 3, 2022, 8:15pm

Well my code is pretty straightforward, read ES then write parquet.

Our Elasticsearch index is a 3 shards index of 80Go, with 70.000.000 docs.

Spark run on YARN, 3 executors of 4 cores / 22Go memory, during 1 hour.

I tune es4Hadoop:

es.scroll.size=4000
es.input.max.docs.per.partition=5000000
es.scroll.keepalive=60m

So this gives 12 tasks, that runs all in // . We monitor this with the very nice elastic APM agent:

I was just wondering what is in the 20G heap and where go the ES data during the long search operation?

So I am going to test with

es.scroll.size=500

ebuildy · March 4, 2022, 6:58am

Woa! I can see the heap is totally different with scroll.size=500 :

but it took 2 hours, so I have to figure out the best parameters to deal with

time duration
search // with es.input.max.docs.per.partition
heap with scroll.size

Thanks you for the clear explanation!

Keith_Massey · March 4, 2022, 3:15pm

I'm surprised it takes that long (although I don't know anything about your hardware or data, so maybe it's expected). Have you tried just reading all the data from Elasticsearch and then dropping it, to isolate if the slowness is reading from Elasticsearch or writing to parquet? My assumption is that the reading from Elasticsearch is the slow part, but I could be wrong.

ebuildy · March 4, 2022, 3:37pm

I will on monday.

Or maybe, is there any benchmark tools I could run to test reading ES from Spark? So I could compare results.

system · April 1, 2022, 3:38pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Search scroll high off-heap memory usage Elasticsearch docker	4	418	April 25, 2022
Stress testing ES-Hadoop Elasticsearch es-hadoop	7	1663	July 6, 2017
Jvm Heap issues while indexing large data Elasticsearch es-hadoop	17	1748	August 28, 2017
Correct setting of "es.scroll.size" with for optimal Spark read performance Elasticsearch es-hadoop	2	3503	July 27, 2017
Reading from Elasticsearch index using spark ( es-hadoop ) connectors Elasticsearch es-hadoop	2	1668	March 22, 2022

Read from spark big index and memory usage

Related topics