I would like to have your opinion about the next challenge I face:
We have an index with about 200M records.
We would like get all the records periodically, and send it to a service for some internal purpose.
I tried using scroll and scan method, I reached speed of 15 min per 10M records, tried playing with the params, but couldn't improve it. I wonder what will be the next best practice:
using a service like spark to make a distributed query?
use multiprocessing in python to the scan/scroll method?
querying specifically the shards, and this way achieving parallelism?
splitting the index into multiple indexes, and this way achieving parallelism?
I hope some of you faced same issue and will gladly hear your opinino.
I think I used to many words and confused you.
We want to get one field out of an index of about 200M records.
we would like to donwload it in resonable time - ~2 hours.
what is the best way for this?
If you want quick stream access to your full data set, using Kafka is probably a better option than Elasticsearch as it is not optimized for this the same way Kafka is. If you were searching for and extracting subsets of data based on complex criteria it might be a different story.
If you explain a bit more about the data set, e.g. how frequently it is added to and whether it is updated or not, we might be able to give better advice. How large are the documents? You mention needing only one field - how large portion of the document is this?
Ok, lets forget about the streaming.
I need to get all the data from an index. the index has about 200M records,with 5 fields per doc,
and I need only one of them, lets call the field "name".
the data is periodically updated.
Then I would recommend looking into using a sliced scroll to increase the parallelism. How many primary shards does the index you are reading from have?
so I checked the sliced scroll option using multiprocessing,
the thing is I'm not sure how exactly elastic splitting the slices on the shards ( found a formula in the docs, but it isn't clear ) and so I'm afraid a bit to overload the cluster.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.