Scroll api or search_after giving duplicate records

arjun_kumar · February 13, 2020, 4:22pm

Hi
I would like query huge amount of data (more than 500k), to achieve this, I used scroll API and searc_after(with modified time)
in both the cases it's giving me the same records(duplicate) multiple times

   {        
     "size": "10000",
     "scroll": "100s",
     "body":{
       "query":{
         "bool":{
           "must":[
             {"query_string":{"query":"*"}}
             ]
         }
       }
     }
    }

also tried with the search after

{
    "from":0,
    "size": 10000,
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857],
    "sort": [
        {"date": "asc"}
    ]
}

I tried with both the approaches but still getting duplicates the total records count matches with total records count but getting duped records means that some of the records are got missed

kindly help me, and thanks in advance

ES version:6.8.0

spinscale · February 13, 2020, 4:48pm

Hey,

so search_after can give duplicate entries while indexing/updating of documents is happening. Scroll search however uses a point in time snapshot. Is there any chance you can reproduce that behaviour reliably with a small dataset that you can share plus all the queries?

--Alex

arjun_kumar · February 13, 2020, 6:17pm

Hi @spinscale

Below is the sample data and query (scroll)

{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total":500909,
"max_score": 1.0,
"hits": [
    "_index": "my_index",
    "_type": "simpleindex",
    "_id": "13213",
    "_score": 1.0,
    "_source": {
           "name": 2019,
           "type": "type-1",
            "dept": 2,
            "id": "",
            "status": "deleted",
            "modifiedTime": "2020-02-12T13:42:54.662Z"
          }
   }
}

and query is

 {        
     "size": "10000",
     "scroll": "100s",
     "body":{
       "query":{
         "bool":{
           "must":[
             {"query_string":{"query":"*"}}
             ]
         }
       }
     }
    }

we are trying to connect the ES with https://www.elastic.co/guide/en/elasticsearch/client/javascript-api/current/client-usage.html(ver:15.2.0)
and the usage is

response = await client.scroll({
scrollId: result._scroll_id,
scroll: "30S"
})

arjun_kumar · February 14, 2020, 3:30am

Hi @spinscale,
can you please look into this

spinscale · February 14, 2020, 9:05am

Hey,

first, this is a completely volunteer driven forum, so there is no guarantee of answering, also not when pinging people directly, especially not, if they have not answered within 9h after you wrote a post. If you need support with SLAs, take a look at Elastic subscriptions.

Second, this is not an example I can reproduce locally, so it's hard to find if there is a problem with the requests.

Can you reproduce this behaviour if you do not use the javascript client, but use the dev-tools console? if you do, can you share requests and responses that you executed?

--Alex

system · March 13, 2020, 9:06am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Bulk Data using Scan & Scroll API Elasticsearch	4	950	July 5, 2017
Scroll API with slice gives duplicated entry Elasticsearch	1	631	September 18, 2018
SearchAfter and possible duplicates? Elasticsearch	2	1129	February 15, 2018
Duplicated records returned using pagination after update Elasticsearch	3	991	July 6, 2017
Duplicate results in search of index via alias after restoring snapshot of index to a new name Elasticsearch	3	3005	October 13, 2017

Scroll api or search_after giving duplicate records

Related topics