Fetch million doc in seconds

saurabhbhatt · July 2, 2020, 10:35am

Hi,

To give brief, I have an index "allreportingdataindex" (example) and it has over 7,50,000 records. Each document has about about 20 columns and about 15 columns with array values. The ES instance is installed on server and we access it over HTTP.

Now when I try to do "matchall" it takes about 17 minutes to get all data. If I only try to get 1 column of ID/number type, it takes 8 minutes. I need to fetch all the data within 5 seconds. Is this possible? What do I do?

Please help!

dadoonet · July 2, 2020, 10:47am

Not sure you can do it. May be it also depends on your hardware (SSD) and the network?

But you can use:

the size and from parameters to display by default up to 10000 records to your users. If you want to change this limit, you can change index.max_result_window setting but be aware of the consequences (ie memory).
the search after feature to do deep pagination.
the Scroll API if you want to extract a resultset to be consumed by another tool later.

What did you do so far?

saurabhbhatt · July 2, 2020, 10:55am

So this is basically a Reporting project, so I will need all the data upfront and pass it on client side to datatable js. I am already using scroll API and I am getting all data but my main requirement is to get it faster.

Will using SSD make big difference to performance? Also this ES instance is hosted on VM.

Please suggest some way

Christian_Dahlqvist · July 2, 2020, 11:00am

If you can make sure the full index fits in the OS page cache you may also see an improvement in performance.

defalt · July 2, 2020, 1:04pm

Yes, an SSD makes a big difference.

saurabhbhatt · July 2, 2020, 2:56pm

But this data keeps updating. atleast twice a day. Also will caching this into OS page cache affect the server throughput/performance?

Christian_Dahlqvist · July 2, 2020, 3:10pm

If you cannot cache it all you need very fast disks. I doubt you will get down to seconds though..

Vinayak_Sapre · July 3, 2020, 5:52am

Saurabh,
What is the size of data (match_all response or index) in bytes?

saurabhbhatt · July 3, 2020, 12:04pm

@Vinayak_Sapre Hi!

So the data is as below:
813 MB = 813000000 Bytes (in decimal)
813 MB = 852492288 Bytes (in binary)

Vinayak_Sapre · July 3, 2020, 5:15pm

Saurabh,

I did some experiments on my laptop. To put findings in context here is my setup

Single node cluster with 8GB heap
20 indices / 36 shards / 15GB total data
Specific index used for experiment: 2 shards / 6.5M docs / 1.37GB index size / best_compression codec / max_result_window = 1000,000
Nothing else was querying / ingesting during tests

match_all query timing increase linearly with size parameter
size = 50K took 2.4 seconds
size = 100K took 4.6 seconds
size = 200K took 9.2 seconds

msearch query with 2 match_all queries with preference set to _shards:0 and _shards:1. Timing increased linearly with size. But timing remained comparable to single match_all in previous test.
size = 50K for each query, took 2.5 seconds (100K docs)
size = 100K took 5.15 seconds (200K docs)
size = 200K took 10.1 seconds (400K docs and response size 250MB)

With index size < 1GB, setting index.max_result_window to a high value as David suggested will reduce your round trips.
msearch will allow you to run multiple queries concurrently.
Multiple queries for msearch can be constructed by shards or scroll with slices or a natural partitioning key like timestamp
If you have multiple nodes in the cluster, more shards will be a better choice.
Instead of msearch, you can run same queries using multiple threads in your app. This will allow you to utilize multiple client nodes and client node will not have to aggregate all results.
Since you are fetching all fields, retrieving _source may be better than fetching 20 doc values.

Matthew_Adams · July 6, 2020, 1:25pm

I really don't understand why you would try and do this and I don't see any way for this to be possible. You are not using any query, simply returning a whole copy of 800 MB of data and you want it to complete in 5s. Can you actually transfer 800 MB in 5s across the network?

If you are feeding a UI, I suggest you use a scroll and the javascript loads the data bit by bit as required e.g. as the user scrolls. I can't imaging the UI wants 800 MB of data to hold in memory either. You could definitely return a 'page' of results in 5 s.

system · August 3, 2020, 1:30pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to fetch ~12M documents(may be even more) quickly from ES using scroll API? Elasticsearch	4	858	December 28, 2017
Slow results retrieval Elasticsearch	5	402	December 17, 2018
Fetching 50,000 documents, not sorted Elasticsearch	4	653	July 5, 2017
Search Time (Response Time) for matchAll query Elasticsearch	5	2359	July 6, 2017
Search Response Time for "match all" query Elasticsearch	1	350	July 6, 2017

Fetch million doc in seconds

Related topics