Scroll api or search_after giving duplicate records

Hi
I would like query huge amount of data (more than 500k), to achieve this, I used scroll API and searc_after(with modified time)
in both the cases it's giving me the same records(duplicate) multiple times

   {        
     "size": "10000",
     "scroll": "100s",
     "body":{
       "query":{
         "bool":{
           "must":[
             {"query_string":{"query":"*"}}
             ]
         }
       }
     }
    }

also tried with the search after

{
    "from":0,
    "size": 10000,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857],
    "sort": [
        {"date": "asc"}
    ]
}

I tried with both the approaches but still getting duplicates the total records count matches with total records count but getting duped records means that some of the records are got missed

kindly help me, and thanks in advance

ES version:6.8.0

Hey,

so search_after can give duplicate entries while indexing/updating of documents is happening. Scroll search however uses a point in time snapshot. Is there any chance you can reproduce that behaviour reliably with a small dataset that you can share plus all the queries?

--Alex

Hi @spinscale

Below is the sample data and query (scroll)

{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total":500909,
"max_score": 1.0,
"hits": [
    "_index": "my_index",
    "_type": "simpleindex",
    "_id": "13213",
    "_score": 1.0,
    "_source": {
           "name": 2019,
           "type": "type-1",
            "dept": 2,
            "id": "",
            "status": "deleted",
            "modifiedTime": "2020-02-12T13:42:54.662Z"
          }
   }
}

and query is

 {        
     "size": "10000",
     "scroll": "100s",
     "body":{
       "query":{
         "bool":{
           "must":[
             {"query_string":{"query":"*"}}
             ]
         }
       }
     }
    }

we are trying to connect the ES with https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/client-usage.html(ver:15.2.0)
and the usage is

response = await client.scroll({
scrollId: result._scroll_id,
scroll: "30S"
})

Hi @spinscale,
can you please look into this

Hey,

first, this is a completely volunteer driven forum, so there is no guarantee of answering, also not when pinging people directly, especially not, if they have not answered within 9h after you wrote a post. If you need support with SLAs, take a look at Elastic subscriptions.

Second, this is not an example I can reproduce locally, so it's hard to find if there is a problem with the requests.

Can you reproduce this behaviour if you do not use the javascript client, but use the dev-tools console? if you do, can you share requests and responses that you executed?

--Alex

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.