Read from spark big index and memory usage

We use Spark to read a full index of 80Go then write data as a parquet file.

I see memory consumption is going up until 24G per Spark worker.

Hence my question: during the scroll search (1 hour), is the data stored in JVM heap on es4Hadoop side?

We store one batch of documents (the number of documents specified in es.scroll.size) in heap memory in es-hadoop. That's typically nowhere near 24G.
Without having seen your code, here are some thoughts:

  • What are you specifying for your executor memory size? Is it possible that the garbage collector is just waiting until you've used all the memory?
  • What is your batch size (es.scroll.size)? The default went from 50 to 1000 in 8.0, so if you just recently upgraded to 8.0 you might be seeing more memory use (and better performance) than before. If your average document size is unusually large you might want to reduce that.
  • Is it possible you're not letting go of memory somewhere? Can you reproduce the memory leak in a code snippet you can paste here?

Well my code is pretty straightforward, read ES then write parquet.

Our Elasticsearch index is a 3 shards index of 80Go, with 70.000.000 docs.

Spark run on YARN, 3 executors of 4 cores / 22Go memory, during 1 hour.

I tune es4Hadoop:

es.scroll.size=4000
es.input.max.docs.per.partition=5000000
es.scroll.keepalive=60m

So this gives 12 tasks, that runs all in // . We monitor this with the very nice elastic APM agent:

I was just wondering what is in the 20G heap and where go the ES data during the long search operation?

So I am going to test with

es.scroll.size=500

Woa! I can see the heap is totally different with scroll.size=500 :

but it took 2 hours, so I have to figure out the best parameters to deal with

  • time duration
  • search // with es.input.max.docs.per.partition
  • heap with scroll.size

Thanks you for the clear explanation!

I'm surprised it takes that long (although I don't know anything about your hardware or data, so maybe it's expected). Have you tried just reading all the data from Elasticsearch and then dropping it, to isolate if the slowness is reading from Elasticsearch or writing to parquet? My assumption is that the reading from Elasticsearch is the slow part, but I could be wrong.

I will on monday.

Or maybe, is there any benchmark tools I could run to test reading ES from Spark? So I could compare results.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.