We use Spark to read a full index of 80Go then write data as a parquet file.
I see memory consumption is going up until 24G per Spark worker.
Hence my question: during the scroll search (1 hour), is the data stored in JVM heap on es4Hadoop side?
We use Spark to read a full index of 80Go then write data as a parquet file.
I see memory consumption is going up until 24G per Spark worker.
Hence my question: during the scroll search (1 hour), is the data stored in JVM heap on es4Hadoop side?
We store one batch of documents (the number of documents specified in es.scroll.size
) in heap memory in es-hadoop. That's typically nowhere near 24G.
Without having seen your code, here are some thoughts:
es.scroll.size
)? The default went from 50 to 1000 in 8.0, so if you just recently upgraded to 8.0 you might be seeing more memory use (and better performance) than before. If your average document size is unusually large you might want to reduce that.Well my code is pretty straightforward, read ES then write parquet.
Our Elasticsearch index is a 3 shards index of 80Go, with 70.000.000 docs.
Spark run on YARN, 3 executors of 4 cores / 22Go memory, during 1 hour.
I tune es4Hadoop:
es.scroll.size=4000
es.input.max.docs.per.partition=5000000
es.scroll.keepalive=60m
So this gives 12 tasks, that runs all in // . We monitor this with the very nice elastic APM agent:
I was just wondering what is in the 20G heap and where go the ES data during the long search operation?
So I am going to test with
es.scroll.size=500
Woa! I can see the heap is totally different with scroll.size=500
:
but it took 2 hours, so I have to figure out the best parameters to deal with
es.input.max.docs.per.partition
scroll.size
Thanks you for the clear explanation!
I'm surprised it takes that long (although I don't know anything about your hardware or data, so maybe it's expected). Have you tried just reading all the data from Elasticsearch and then dropping it, to isolate if the slowness is reading from Elasticsearch or writing to parquet? My assumption is that the reading from Elasticsearch is the slow part, but I could be wrong.
I will on monday.
Or maybe, is there any benchmark tools I could run to test reading ES from Spark? So I could compare results.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.