Scroll API is dropping records

jerry_jacob · June 11, 2024, 10:42am

ES version: 7.5.1

{
  "query": {
    "range": {
      "columnName": {
        "gt": "2024-05-28T16:00:25Z",
        "lte": "2024-05-28T19:00:24Z"
      }
    }
  }
}

I have a program that runs above query to pull records via scroll api. But recently I started noticing that some times count pulled by the program is less than the number that I will get if I run the same query directly on the ES index. This is not a consistent behaviour and is observed sporadically. I have seen GC running while the program is running and records are lost but noticed that the behaviour is not consistent and not all the time GC is running records are being dropped. Anyone aware of any particular scenario in which scroll api will drop records?

Carlos_D · June 13, 2024, 8:06am

hey @jerry_jacob !

Is the index receiving new documents? Per the Scroll API docs:

Scroll API reflects the state of the index at the time of the initial search request. Subsequent indexing or document changes only affect later search and scroll requests.

How many results is your query returning? It is possible that you need to track total hits to ensure they are being counted accurately if you have more than 10,000 results.

jerry_jacob · June 13, 2024, 10:24am

hello @Carlos_D
Yes we are able to index new documents. Also this issue happens when the number of records(Hits.Hits) are less than 10,000 as well as more than 10,000 and in fact I have seen that scroll is returning 0 results in some cases which makes me think that this issue is not related to scroll api itself.

Note: I am unable to replicate this behaviour consistently in the environment and hence I am suspecting if any background process is running at certain time periods which is causing this issue. On that front I tried to check details like indexing time, merge time, fetch time etc but nothing seems to be out of the ordinary.

Carlos_D · June 13, 2024, 11:01am

If the index is not read-only, it is expected that you have different results using scroll and doing a separate query for the reason mentioned - scroll will keep the view of the index as was when the scrolling started.

Are you getting search failures, as in shard failures for your requests? That could explain the difference as well.

jerry_jacob · June 13, 2024, 12:35pm

hello @Carlos_D, thank you for your suggestions.
I do read and write on the same index. Agree that any results getting inserted at the time of read will not be available in scroll but in my case records that were inserted more than 1 hr ago are missing in the search results and I have not changed the default refresh interval.
Also I have checked the elasticsearch logs as well and I do not see any errors related to shards in the logs for the index that is giving me issue (though i do see some issues with shards for other indices)

Christian_Dahlqvist · June 13, 2024, 12:41pm

The version you are using is very old. I would recommend upgrading to at least 7.17 and start using search after with a point-in-time as recommended for consistency in the docs.

jerry_jacob · June 14, 2024, 9:40am

Thank you @Christian_Dahlqvist for the suggestion. I will check the possibility of ES migration

rishab_kumar22 · July 26, 2024, 4:11pm

@Christian_Dahlqvist , we are also facing the same issue of records getting dropped with scroll API. We are able to consistently reproduce the issue when a sort on date/float column is added in the scroll API request.

ES Version: 8.9.1

Below is the request

curl --location --request GET 'http://localhost:9200/test_index/_search?scroll=5m&typed_keys=true' \
--header 'Content-Type: application/json' \
--data '{
	"query": {
		"bool": {
			"filter": [
				{
					"terms": {
						"SomeAttribute.keyword": [
							"100"
						]
					}
				}
			]
		}
	},
	"size": 1000,
	"sort": [
		{
			"DateColumn": {
				"order": "desc",
				"unmapped_type": "date"
			}
		},
		{
			"_uid": {
				"order": "asc",
				"unmapped_type": "keyword"
			}
		}
	],
	"_source": {
		"includes": [
			"Field1",
            "Field2"
		]
	}
}'

Can you please help here?

Christian_Dahlqvist · July 26, 2024, 7:00pm

The same recommendation applies to you. Use search after with PIT instead of scroll as recoomended in the docs I linked to.

rishab_kumar22 · July 29, 2024, 4:44am

Thank you @Christian_Dahlqvist . We will try using the PIT. Any pointers on what might be causing this issue? As this is happening only when we use the sort with date or number column.

Topic		Replies	Views
Missing documents in scroll when there is GC on server Elasticsearch	5	478	February 22, 2022
Scroll returns inconsistent number of results Elasticsearch	4	2125	March 8, 2018
Small difference of number of results with scroll Elasticsearch	4	328	May 27, 2019
Scroll search returns old versions of documents Elasticsearch	1	608	July 6, 2017
Get different total number of returning result every time when using scroll search Elasticsearch	1	343	May 14, 2019

Scroll API is dropping records

Related topics