Is it possible : Stream 50k-100k document < 1sec?

Hi ,

We have a usecase in our projects, where elasticsearch is used for search and need to send anywhere from 50,000-100,000 documents for further processing and then send to client.
The response document should contain minimally 1 field (document IDs) alone from elasticsearch.

With HTTP REST Client , we were able to transfer in 10+secs. Is it possible with any other ways to
transfer at-least 50,000 documents in < 1sec ?

Any thoughts . ...

Ingest performance will depend on size and complexity of documents, mappings used, bulk size, size and specification of cluster and hardware used. I have seen clusters able to index several hundred thousand documents per second so it is possible. If you have very high peaks in throughput you may need a larger cluster than expected.

If you tell us a bit more about the data we can probably give a better answer.

@Christian_Dahlqvist Thanks for the info.
My problem i am describing is not in ingestion of records but during search.

The search query selects on an average 50,000-80,000 records and I would like to
know how fast I could transfer the whole result-sets for further processing.

In our cases , the search response took around 10+secs and documents contains
max of 20 json key-value pairs ~2KB.Hardware is like AWS Cloud 2 instances of c4.8xLarge.
with max documents of 5million records

I remember writing that answer to a different thread on my mobile, so am sorry it somehow ended up on the wrong thread without me noticing it.

Retrieving large volumes of documents is not necessarily what Elasticsearch was designed and optimized for and results in a lot of random disk reads if your indices are not fully cached by the operating system. If your data set is not fully cached by the operating system I would expect the response time to be more limited by disk speed than CPU.

In order to optimize this it would help if you could answer the following questions:

  • How large is your total data set?
  • How many indices/shards/replicas do you currently have in place? How much space on disk do these occupy?
  • If you run your queries and return only a small amount of documents, e.g. 1000, how long do they take to execute then?
  • How many concurrent queries of the type you mentioned do you need to support?
  • When retrieving large amounts of data it is recommended to use a scroll query. What settings are you currently using for yours?
  • Given that you are using c4 instances I assume you may be using EBS. If so, what is the size and specification of your EBS volumes?