To give brief, I have an index "allreportingdataindex" (example) and it has over 7,50,000 records. Each document has about about 20 columns and about 15 columns with array values. The ES instance is installed on server and we access it over HTTP.
Now when I try to do "matchall" it takes about 17 minutes to get all data. If I only try to get 1 column of ID/number type, it takes 8 minutes. I need to fetch all the data within 5 seconds. Is this possible? What do I do?
Not sure you can do it. May be it also depends on your hardware (SSD) and the network?
But you can use:
the size and from parameters to display by default up to 10000 records to your users. If you want to change this limit, you can change index.max_result_window setting but be aware of the consequences (ie memory).
So this is basically a Reporting project, so I will need all the data upfront and pass it on client side to datatable js. I am already using scroll API and I am getting all data but my main requirement is to get it faster.
Will using SSD make big difference to performance? Also this ES instance is hosted on VM.
I did some experiments on my laptop. To put findings in context here is my setup
Single node cluster with 8GB heap
20 indices / 36 shards / 15GB total data
Specific index used for experiment: 2 shards / 6.5M docs / 1.37GB index size / best_compression codec / max_result_window = 1000,000
Nothing else was querying / ingesting during tests
match_all query timing increase linearly with size parameter
size = 50K took 2.4 seconds
size = 100K took 4.6 seconds
size = 200K took 9.2 seconds
msearch query with 2 match_all queries with preference set to _shards:0 and _shards:1. Timing increased linearly with size. But timing remained comparable to single match_all in previous test.
size = 50K for each query, took 2.5 seconds (100K docs)
size = 100K took 5.15 seconds (200K docs)
size = 200K took 10.1 seconds (400K docs and response size 250MB)
With index size < 1GB, setting index.max_result_window to a high value as David suggested will reduce your round trips.
msearch will allow you to run multiple queries concurrently.
Multiple queries for msearch can be constructed by shards or scroll with slices or a natural partitioning key like timestamp
If you have multiple nodes in the cluster, more shards will be a better choice.
Instead of msearch, you can run same queries using multiple threads in your app. This will allow you to utilize multiple client nodes and client node will not have to aggregate all results.
Since you are fetching all fields, retrieving _source may be better than fetching 20 doc values.
I really don't understand why you would try and do this and I don't see any way for this to be possible. You are not using any query, simply returning a whole copy of 800 MB of data and you want it to complete in 5s. Can you actually transfer 800 MB in 5s across the network?
If you are feeding a UI, I suggest you use a scroll and the javascript loads the data bit by bit as required e.g. as the user scrolls. I can't imaging the UI wants 800 MB of data to hold in memory either. You could definitely return a 'page' of results in 5 s.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.