Sort by _doc as fast as search_type=scan


(weibin.wu) #1

Hi ES:

I upgrade my cluster from 1.4 to 5.1.

However, I found search_type=scan is deprecated. And the document recommended us to use sort["_doc"] instead.

I tried to sort by _doc during searching but I found the speed is 5 times slower than using _scan in 1.4.

My use case is to extract all the document out in bulk format from one index.

Any ideas?


(Martijn Van Groningen) #2

Do you have any changes to the mappings between 1.4 and 5.1, also does you 1.4 cluster and 5.1 contain the same amount of documents and indices?

Can you maybe also share the exact search request that you used for ES 1.4 and ES 5.1?


(weibin.wu) #3

the mapping in 5.x

{
  "aliases": {},
 "mappings": {
"changeling-models-logling": {
  "properties": {
    "id": {
      "type": "keyword"
    },
    "klass": {
      "type": "keyword"
    },
    "modifications": {
      "type": "text"
    },
    "modified_at": {
      "type": "date"
    },
    "modified_by": {
      "fields": {
        "keyword": {
          "ignore_above": 256,
          "type": "keyword"
        }
      },
      "type": "text"
    },
    "modified_fields": {
      "type": "text"
    },
    "oid": {
      "type": "keyword"
    }
  }
}
 },
"settings": {
"index": {
  "number_of_replicas": "1",
  "number_of_shards": "5"
}
}
}

mapping in 1.4

{
"aliases": {
  
},
"mappings": {
  "changeling/models/logling": {
    "properties": {
      "id": {
        "type": "string"
      },
      "klass": {
        "type": "string"
      },
      "modifications": {
        "type": "string"
      },
      "modified_at": {
        "type": "date",
        "format": "dateOptionalTime"
      },
      "modified_by": {
        "type": "string"
      },
      "modified_fields": {
        "type": "string",
        "analyzer": "keyword"
      },
      "oid": {
        "type": "string"
      }
    }
  }
},
"settings": {
  "index": {
    "number_of_replicas": "1",
    "number_of_shards": "5",
    "refresh_interval": "60s"
  }
},
"warmers": {
  
}
}

The query I use to dump the data is
POST /myteksi-changeling_changeling_models_loglings_2015_04_5x/_search?_source=true&scroll=2m&sort=_doc%3Aasc
{ "query": { "bool": { "filter" : [{ "range" : {"modified_at" : {"from" :"2015-04-01","to": "2015-04-01","time_zone":"+08:00"}}}]}}}
POST /_search/scroll?scroll=2m


(Ryan Ernst) #4

One difference I see is the refresh interval. In 1.4 you have it set to 60 seconds, but you leave it as the default in 5.0 (which is 1 second). That may or may not cause issues (more merging in the background, so more cpu not available for searches).

But also your request specifies sort on doc, which it should be _doc. I'm confused on how this does not raise an error. Can you confirm you passed in sort=doc and it did not give an error for you?


(weibin.wu) #5

I just see the document in 5.x when use scroll is
POST /_search?scroll=2m

But i am using
POST /_search/scroll?scroll=2m

Does it matter?


(Ryan Ernst) #6

The latter would be used if you had an index named scroll.


(system) #7

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.