Speed Up Query of huge indices

arkash20 · May 30, 2021, 7:03am

Hi,

I would like to have your opinion about the next challenge I face:
We have an index with about 200M records.
We would like get all the records periodically, and send it to a service for some internal purpose.
I tried using scroll and scan method, I reached speed of 15 min per 10M records, tried playing with the params, but couldn't improve it. I wonder what will be the next best practice:

using a service like spark to make a distributed query?
use multiprocessing in python to the scan/scroll method?
querying specifically the shards, and this way achieving parallelism?
splitting the index into multiple indexes, and this way achieving parallelism?

I hope some of you faced same issue and will gladly hear your opinino.

thanks in advance,

warkolm · May 31, 2021, 6:47am

Elasticsearch is not designed to be a streaming service like this, have you considered using something like kafka?

Otherwise the only faster way would be to try scaling up your cluster hardware.

arkash20 · May 31, 2021, 6:50am

Hi!,

I think I used to many words and confused you.
We want to get one field out of an index of about 200M records.
we would like to donwload it in resonable time - ~2 hours.
what is the best way for this?

Christian_Dahlqvist · May 31, 2021, 6:55am

If you want quick stream access to your full data set, using Kafka is probably a better option than Elasticsearch as it is not optimized for this the same way Kafka is. If you were searching for and extracting subsets of data based on complex criteria it might be a different story.

If you explain a bit more about the data set, e.g. how frequently it is added to and whether it is updated or not, we might be able to give better advice. How large are the documents? You mention needing only one field - how large portion of the document is this?

arkash20 · May 31, 2021, 7:14am

Ok, lets forget about the streaming.
I need to get all the data from an index. the index has about 200M records,with 5 fields per doc,
and I need only one of them, lets call the field "name".
the data is periodically updated.

Christian_Dahlqvist · May 31, 2021, 7:19am

Then I would recommend looking into using a sliced scroll to increase the parallelism. How many primary shards does the index you are reading from have?

Which version of Elasticsearch are you using?

arkash20 · May 31, 2021, 7:24am

elastic version - 7.12.0.
have 12 shards.

so I checked the sliced scroll option using multiprocessing,
the thing is I'm not sure how exactly elastic splitting the slices on the shards ( found a formula in the docs, but it isn't clear ) and so I'm afraid a bit to overload the cluster.

system · June 28, 2021, 7:24am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Retrieving over a million records in Elasticsearch Elasticsearch	10	28102	July 5, 2017
Tunning ElasticSearch with Spark Elasticsearch	1	382	July 5, 2017
How to get large response to query fast? Elasticsearch	2	841	August 31, 2017
Superslow simple query Elasticsearch	4	1138	July 6, 2017
Query Performance Elasticsearch	11	1824	July 6, 2017

Speed Up Query of huge indices

Related topics