Hello folks,
Using a non-unique key for sort (in this example - timestamp) can skip over records in a subsequent request when the 10,000 is reached on a previous run and timestamps carry on beyond that cutoff.
Take the example that for one of the results, half of the results are for timestamp=t1
and half are with timestamp=t2
. The next 10,000 records contains some (say 500 entries) with timestamp=t2
. Based on the current implementation, the value of the sort
array for the next query would be t2
. Elasticsearch in the next response will skip over the 500 entries in the next chunk. This can cause much less results coming through compared to the expected count. The expected counts based on the _count
API in ES and the results of scroll
which match. The search query return an order of magnitude less for some of the test runs.
An example query --
curl -X GET "localhost:9200/indexname*/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 10000,
"_source": ["abc.123", "abc.124"],
"query": {
"range": {
"@timestamp": {
"gte": "now-1d/d",
"lt": "now/d"
}
}
},
"sort": [
{"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos"}}
]
}
'
The next query would be something like:
curl -X GET "localhost:9200/_search?pretty" -H 'Content-Type: application/json' -d'
{
"size": 10000,
"query": {
"range": {
"@timestamp": {
"gte": "now-4h/h",
"lt": "now/h"
}
}
},
"sort": [
{"@timestamp": {"order": "asc", "format": "strict_date_optional_time_nanos"}}
]
"search_after": [
"2022-06-22T03:00:28.000Z"
]
}
'
Using PIT does not mitigate this issue of needing to go through every record that matches as it is used only as a tie-breaker and timestamp seems to win which means the skipping will occur. Appreciate any ideas and thoughts on next steps.
ES version: 7.16.2