Fetch million doc in seconds

Hi,

To give brief, I have an index "allreportingdataindex" (example) and it has over 7,50,000 records. Each document has about about 20 columns and about 15 columns with array values. The ES instance is installed on server and we access it over HTTP.

Now when I try to do "matchall" it takes about 17 minutes to get all data. If I only try to get 1 column of ID/number type, it takes 8 minutes. I need to fetch all the data within 5 seconds. Is this possible? What do I do?

Please help! :slight_smile:

Not sure you can do it. May be it also depends on your hardware (SSD) and the network?

But you can use:

  • the size and from parameters to display by default up to 10000 records to your users. If you want to change this limit, you can change index.max_result_window setting but be aware of the consequences (ie memory).
  • the search after feature to do deep pagination.
  • the Scroll API if you want to extract a resultset to be consumed by another tool later.

What did you do so far?

So this is basically a Reporting project, so I will need all the data upfront and pass it on client side to datatable js. I am already using scroll API and I am getting all data but my main requirement is to get it faster.

Will using SSD make big difference to performance? Also this ES instance is hosted on VM.

Please suggest some way

If you can make sure the full index fits in the OS page cache you may also see an improvement in performance.

Yes, an SSD makes a big difference.

But this data keeps updating. atleast twice a day. Also will caching this into OS page cache affect the server throughput/performance?

If you cannot cache it all you need very fast disks. I doubt you will get down to seconds though..

Saurabh,
What is the size of data (match_all response or index) in bytes?

@Vinayak_Sapre Hi!

So the data is as below:
813 MB = 813000000 Bytes (in decimal)
813 MB = 852492288 Bytes (in binary)

Saurabh,

I did some experiments on my laptop. To put findings in context here is my setup

  1. Single node cluster with 8GB heap
  2. 20 indices / 36 shards / 15GB total data
  3. Specific index used for experiment: 2 shards / 6.5M docs / 1.37GB index size / best_compression codec / max_result_window = 1000,000
  4. Nothing else was querying / ingesting during tests

match_all query timing increase linearly with size parameter
size = 50K took 2.4 seconds
size = 100K took 4.6 seconds
size = 200K took 9.2 seconds

msearch query with 2 match_all queries with preference set to _shards:0 and _shards:1. Timing increased linearly with size. But timing remained comparable to single match_all in previous test.
size = 50K for each query, took 2.5 seconds (100K docs)
size = 100K took 5.15 seconds (200K docs)
size = 200K took 10.1 seconds (400K docs and response size 250MB)

  • With index size < 1GB, setting index.max_result_window to a high value as David suggested will reduce your round trips.
  • msearch will allow you to run multiple queries concurrently.
  • Multiple queries for msearch can be constructed by shards or scroll with slices or a natural partitioning key like timestamp
  • If you have multiple nodes in the cluster, more shards will be a better choice.
  • Instead of msearch, you can run same queries using multiple threads in your app. This will allow you to utilize multiple client nodes and client node will not have to aggregate all results.
  • Since you are fetching all fields, retrieving _source may be better than fetching 20 doc values.
2 Likes

I really don't understand why you would try and do this and I don't see any way for this to be possible. You are not using any query, simply returning a whole copy of 800 MB of data and you want it to complete in 5s. Can you actually transfer 800 MB in 5s across the network?

If you are feeding a UI, I suggest you use a scroll and the javascript loads the data bit by bit as required e.g. as the user scrolls. I can't imaging the UI wants 800 MB of data to hold in memory either. You could definitely return a 'page' of results in 5 s.

2 Likes

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.