Fasted way to receive large results


(lasseschou) #1

I'm querying a large index in ElasticSearch and retrieve a large list of documents like the one below:

{
  "took": 5764,
  "timed_out": false,
  "_shards": {
     "total": 30,
     "successful": 30,
     "failed": 0
  },
  "hits": {
     "total": 163453,
     "max_score": null,
     "hits": [
        {
           "_index": "my_index",
           "_type": "my_type",
           "_id": "aQNyN188wutp91L-OkulDw",
           "_score": null,
           "fields": {
              "id": [
                 "c7f85s365g4f2g46e2a6h3820"
              ],
              "data": [
                 [
                    "6|76|76|883|470|#ex > li:[3]",
                    "14|775|0|0|2863|null",
                    "7|6521|6521|822|475|#ex > li:[3]"
                 ]
              ]
           },
           "sort": [
              1436399116802,
              1
           ]
        },
        ...
    ]}}

This list can contain hundreds of thousands of results. I'm only interested in the "id" and "data" fields. The data is sent to my application server (connected with a 1GB connection), is parsed and sent to the client. Currently, no matter what I do, I see "took" times around 5-6 seconds, but after retrieving the documents and parsing them, I'm more than doubling the total time. I want to shave off as many milliseconds as possible.

Ideas:

  • GZIP: Currently using gzip. Is this good or bad? (see below)
  • Streaming: I'm trying to stream the results using ElasticSearch.Net's .Search feature. But it still takes many seconds before the stream object is created. Could this be due to the gzip, or is this simply because of serializing the response? Can you tell me if search results can really be streamed from ElasticSearch?
  • Scan-and-scroll: Would it make sense to retrieve the document in chunks and then parse it incrementally on the app server?
  • Disabling unneeded fields: I'm almost 100% sure this isn't possible with ElasticSearch's core features, but why can't I disable the unneeded _id, _index, _type and _score fields?
  • Alternative to JSON: It would be great if I could bypass the expensive JSON serializer inside ES and just output the result as a byte stream or text stream
  • Other ideas? Would be most helpful.

Thanks so much for your input!


(Colin Goodheart-Smithe) #2

How many results are you trying to retrieve?

The normal search API is not designed for deep pagination and asking for a large size can be quite slow as described here. If you don't care about the score (as you suggested above) then I would recommend using the scan-scroll API and retrieve the results in smallish chunks.

HTH


(system) #3