Recommendation to use search after instead of scrolling

Hello everybody

If you want to read an index completely, you can read in the documentation that the search after function should be used from 10,000 documents (instead of scroll).

We have now tested this and unfortunately we have noticed a slowdown by a factor of 10. we sort with the _id meta field. If we use a technical id as a number for sorting, the speed is the same as when scrolling.

unfortunately we do not have a technical number field in our use case.

why is sorting with a text field much slower? or are we still not using something properly?

many thanks and best regards

What does you query look like exactly? What is the mapping?

Do you mean this?

We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).

The deep pagination does not mean extracting the whole resultset.
If your goal is to extract the whole resultset, you should IMHO use the _scroll API.
This might not solve the problem here though.

What is your use case?

hi and thanks for the fast answer.

yes, exactly.

this is our query:

{"size":5000,"sort":[{"_id.keyword":{"order":"asc"}}]}

and the next chunk query:

{"size":5000,"sort":[{"_id.keyword":{"order":"asc"}}],"search_after":["25acabb7-6a31-470a-a658-7e6a0eeb3f99"]}

our mapping:

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        ...
      }
    }
  }
}

with 100.000 document we have the following times:

scoll: 5,1s
search_after: 49s

if i use a number field for sorting, i have with search_after ~ 5,0s.

the usecase is, that we want to iterate over all index documents in a given chunksize. then we make some magic in java for each chunk. readonly.

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

That's a huge difference indeed.
I'd may be try to decrease the size for search_after. But for your use case anyway, I'd use the scroll API as your goal is to extract everything.

Pinging @jimczi as he might know what is happening.

sorry - fixed.

i have testet it with different sizes.

our test with different sizes:
(first param = document count, second param = size)

code:

    @ParameterizedTest
    @CsvSource({
            "17, 3",
            "10, 2",
            "0, 1",
            "1, 1",
            "13, 2",
            "10, 3"
            "100001, 50",
            "100001, 100",
            "100001, 500",
            "100001, 1000",
            "100001, 5000",
            "100001, 20000",
    })
    void newTestScrollClosing(int givenTestObjects, int chunksize) {
        client.createIndexByFilename(indexName);

        List<TestObject> testObjectList = IntStream.range(0, givenTestObjects).boxed()
                .map(i -> new TestObject(UUID.randomUUID().toString(), i))
                .collect(Collectors.toList());
        client.insertBulk(testObjectList, WriteRequest.RefreshPolicy.IMMEDIATE);

        List<TestObject> results = new ArrayList<>();

        for (ElasticChunkResult<TestObject> result = client.newStartIterate(chunksize); !result
                .isEmpty(); result = client.newNextChunk(result)) {
            results.addAll(result.getObjects());
        }

        assertThat(results).hasSize(givenTestObjects);
        assertThat(results).containsExactlyInAnyOrderElementsOf(testObjectList);
    }

results with search_after:

image

results with _scroll:

image

Java Code for query and search_after:

 public ElasticChunkResult<T> newStartIterate(int viewSize) {
        SearchRequest searchRequest = new SearchRequest(indexname);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.size(viewSize);
        searchSourceBuilder.sort(SortBuilders.fieldSort("_id").order(SortOrder.ASC));
        searchRequest.source(searchSourceBuilder);

        SearchResponse response;
        try {
            response = client.search(searchRequest, RequestOptions.DEFAULT);
        } catch (IOException e) {
            throw new RuntimeException("Fehler beim SearchRequest", e);
        }

        List<T> hits = convertHits(response);

        return new ElasticChunkResult<>(determineLastObject(response, hits), hits, viewSize);
    }

    public ElasticChunkResult<T> newNextChunk(ElasticChunkResult elasticChunkResult) {

        if (elasticChunkResult.isEmpty()) {
            return elasticChunkResult;
        }

        SearchRequest searchRequest = new SearchRequest(indexname);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.size(elasticChunkResult.getChunkSize());
        searchSourceBuilder.sort(SortBuilders.fieldSort("_id").order(SortOrder.ASC));
        searchSourceBuilder.searchAfter(new Object[] { elasticChunkResult.getLastObject() });
        searchRequest.source(searchSourceBuilder);

        SearchResponse response;
        try {
            response = client.search(searchRequest, RequestOptions.DEFAULT);
        } catch (IOException e) {
            throw new RuntimeException("Fehler beim SearchRequest", e);
        }

        List<T> hits = convertHits(response);

        return new ElasticChunkResult<>(determineLastObject(response, hits), hits, elasticChunkResult.getChunkSize());
    }

    private Object determineLastObject(SearchResponse response, List<T> hits) {
        Object lastObject = null;

        if (!hits.isEmpty()) {
            lastObject = response.getHits().getAt(response.getHits().getHits().length - 1).getId();
        }
        return lastObject;
    }

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.