Is Spark useful for reindexation?

Hi all,
I'm at the beginning process of writing an application to reindex all my data (in order to update my mapping, include new fields, etc). I would like to know the best way to go (in terms of speed, I have Tbytes of data). Is Reindex api the good solution? Or would it be useful to use Spark to parallelize the task and make performance gains? Any pointers/links to do such a thing? Any help is greatly appreciated!

Three people on my engineering and ops teams have put a lot of effort into tuning bulk indexing and have had limited success. The best indexing rate we can get is ~70k per second. I have this thread to discuss that issue Spark Bulk Import Performance Benchmarks

What version of Elasticsearch are you using?

This article is a little old but might help

Thank you for your answer Jonathan. We're using ES 2.4.2 but eventually will upgrade to 5.0. So, based on your experience, would you say that it's worthwhile to use Spark or do you get the best performance using something more basic like "scroll to retrieve batches of documents from the old index, and the bulk API to push them into the new index" (ref reindexing your data)?

Yes, I would start off with the technique on reindexing your data because it has less setup and overhead than Spark.

And in terms of performance, how does it compare? Approximately the same?

I don't know the performance of the scroll method but I can say I've spent 6 months optimizing Spark and only get ~70k docs per second. I have 100 TB of data and ended up sampling it before indexing in elasticsearch due to the expensive indexing rate.

Do you have a lot of Spark experience?

Unfortunately no, I'm a beginner only. But your 70K/sec is quite impressive for me who obtain only something like 70K/min! Of course it depends of a lot of things (ex: my average size of doc is 10K, and you?). Sorry for the silly question but what do you mean by "I ended up sampling if before indexing ... due to the expensive indexing rate"? Thanks again for your tips.

Yeah if your a beginner to Spark and don't have the luxury of time I'd use the non-spark method.

70k per sec for a 20 node elasticsearch cluster is not super impressive at all. That results in 6,048,000,000 documents indexed a day and when you have TBs of data your import will run several days. So what I did is sample my original data set by 1/25th. My use case is for data analysis and I have the luxury of producing rough estimates.

I understand. And, if I may ask again, what is your average document size?

Very small. Around 160 bytes.

Thanks again @jspooner. I took a look at the code you gently provided here (https://gist.github.com/jspooner/ccba83a5a8f36fe1276350ef838be38d). Does the spark code only the 19-lines long scala program called push_to_es.scala?

push_to_es.scala is only a snippet of the code.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.