Fasted way to receive large results

lasseschou · July 10, 2015, 7:57am

I'm querying a large index in ElasticSearch and retrieve a large list of documents like the one below:

{
  "took": 5764,
  "timed_out": false,
  "_shards": {
     "total": 30,
     "successful": 30,
     "failed": 0
  },
  "hits": {
     "total": 163453,
     "max_score": null,
     "hits": [
        {
           "_index": "my_index",
           "_type": "my_type",
           "_id": "aQNyN188wutp91L-OkulDw",
           "_score": null,
           "fields": {
              "id": [
                 "c7f85s365g4f2g46e2a6h3820"
              ],
              "data": [
                 [
                    "6|76|76|883|470|#ex > li:[3]",
                    "14|775|0|0|2863|null",
                    "7|6521|6521|822|475|#ex > li:[3]"
                 ]
              ]
           },
           "sort": [
              1436399116802,
              1
           ]
        },
        ...
    ]}}

This list can contain hundreds of thousands of results. I'm only interested in the "id" and "data" fields. The data is sent to my application server (connected with a 1GB connection), is parsed and sent to the client. Currently, no matter what I do, I see "took" times around 5-6 seconds, but after retrieving the documents and parsing them, I'm more than doubling the total time. I want to shave off as many milliseconds as possible.

Ideas:

GZIP: Currently using gzip. Is this good or bad? (see below)
Streaming: I'm trying to stream the results using ElasticSearch.Net's .Search feature. But it still takes many seconds before the stream object is created. Could this be due to the gzip, or is this simply because of serializing the response? Can you tell me if search results can really be streamed from ElasticSearch?
Scan-and-scroll: Would it make sense to retrieve the document in chunks and then parse it incrementally on the app server?
Disabling unneeded fields: I'm almost 100% sure this isn't possible with ElasticSearch's core features, but why can't I disable the unneeded _id, _index, _type and _score fields?
Alternative to JSON: It would be great if I could bypass the expensive JSON serializer inside ES and just output the result as a byte stream or text stream
Other ideas? Would be most helpful.

Thanks so much for your input!

colings86 · July 10, 2015, 8:11am

How many results are you trying to retrieve?

The normal search API is not designed for deep pagination and asking for a large size can be quite slow as described here. If you don't care about the score (as you suggested above) then I would recommend using the scan-scroll API and retrieve the results in smallish chunks.

HTH

Topic		Replies	Views
Superslow simple query Elasticsearch	4	1138	July 6, 2017
Retrieving over a million records in Elasticsearch Elasticsearch	10	28101	July 5, 2017
Performance impact of returning large result sets Elasticsearch	3	4301	July 5, 2017
Speed Up Query of huge indices Elasticsearch	7	428	June 28, 2021
Slow results retrieval Elasticsearch	5	400	December 17, 2018

Fasted way to receive large results

Related topics