Recommendation to use search after instead of scrolling

prelator · March 25, 2021, 6:31am

Hello everybody

If you want to read an index completely, you can read in the documentation that the search after function should be used from 10,000 documents (instead of scroll).

We have now tested this and unfortunately we have noticed a slowdown by a factor of 10. we sort with the _id meta field. If we use a technical id as a number for sorting, the speed is the same as when scrolling.

unfortunately we do not have a technical number field in our use case.

why is sorting with a text field much slower? or are we still not using something properly?

many thanks and best regards

dadoonet · March 25, 2021, 7:57am

What does you query look like exactly? What is the mapping?

Do you mean this?

We no longer recommend using the scroll API for deep pagination. If you need to preserve the index state while paging through more than 10,000 hits, use the search_after parameter with a point in time (PIT).

The deep pagination does not mean extracting the whole resultset.
If your goal is to extract the whole resultset, you should IMHO use the _scroll API.
This might not solve the problem here though.

What is your use case?

prelator · March 25, 2021, 8:37am

hi and thanks for the fast answer.

yes, exactly.

this is our query:

{"size":5000,"sort":[{"_id.keyword":{"order":"asc"}}]}

and the next chunk query:

{"size":5000,"sort":[{"_id.keyword":{"order":"asc"}}],"search_after":["25acabb7-6a31-470a-a658-7e6a0eeb3f99"]}

our mapping:

{
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        },
        ...
      }
    }
  }
}

with 100.000 document we have the following times:

scoll: 5,1s
search_after: 49s

if i use a number field for sorting, i have with search_after ~ 5,0s.

the usecase is, that we want to iterate over all index documents in a given chunksize. then we make some magic in java for each chunk. readonly.

dadoonet · March 25, 2021, 9:04am

Please format your code, logs or configuration files using </> icon as explained in this guide and not the citation button. It will make your post more readable.

Or use markdown style like:

```
CODE
```

This is the icon to use if you are not using markdown format:

That's a huge difference indeed.
I'd may be try to decrease the size for search_after. But for your use case anyway, I'd use the scroll API as your goal is to extract everything.

Pinging @jimczi as he might know what is happening.

prelator · March 25, 2021, 9:58am

sorry - fixed.

i have testet it with different sizes.

our test with different sizes:
(first param = document count, second param = size)

code:

    @ParameterizedTest
    @CsvSource({
            "17, 3",
            "10, 2",
            "0, 1",
            "1, 1",
            "13, 2",
            "10, 3"
            "100001, 50",
            "100001, 100",
            "100001, 500",
            "100001, 1000",
            "100001, 5000",
            "100001, 20000",
    })
    void newTestScrollClosing(int givenTestObjects, int chunksize) {
        client.createIndexByFilename(indexName);

        List<TestObject> testObjectList = IntStream.range(0, givenTestObjects).boxed()
                .map(i -> new TestObject(UUID.randomUUID().toString(), i))
                .collect(Collectors.toList());
        client.insertBulk(testObjectList, WriteRequest.RefreshPolicy.IMMEDIATE);

        List<TestObject> results = new ArrayList<>();

        for (ElasticChunkResult<TestObject> result = client.newStartIterate(chunksize); !result
                .isEmpty(); result = client.newNextChunk(result)) {
            results.addAll(result.getObjects());
        }

        assertThat(results).hasSize(givenTestObjects);
        assertThat(results).containsExactlyInAnyOrderElementsOf(testObjectList);
    }

results with search_after:

results with _scroll:

Java Code for query and search_after:

 public ElasticChunkResult<T> newStartIterate(int viewSize) {
        SearchRequest searchRequest = new SearchRequest(indexname);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.size(viewSize);
        searchSourceBuilder.sort(SortBuilders.fieldSort("_id").order(SortOrder.ASC));
        searchRequest.source(searchSourceBuilder);

        SearchResponse response;
        try {
            response = client.search(searchRequest, RequestOptions.DEFAULT);
        } catch (IOException e) {
            throw new RuntimeException("Fehler beim SearchRequest", e);
        }

        List<T> hits = convertHits(response);

        return new ElasticChunkResult<>(determineLastObject(response, hits), hits, viewSize);
    }

    public ElasticChunkResult<T> newNextChunk(ElasticChunkResult elasticChunkResult) {

        if (elasticChunkResult.isEmpty()) {
            return elasticChunkResult;
        }

        SearchRequest searchRequest = new SearchRequest(indexname);
        SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
        searchSourceBuilder.size(elasticChunkResult.getChunkSize());
        searchSourceBuilder.sort(SortBuilders.fieldSort("_id").order(SortOrder.ASC));
        searchSourceBuilder.searchAfter(new Object[] { elasticChunkResult.getLastObject() });
        searchRequest.source(searchSourceBuilder);

        SearchResponse response;
        try {
            response = client.search(searchRequest, RequestOptions.DEFAULT);
        } catch (IOException e) {
            throw new RuntimeException("Fehler beim SearchRequest", e);
        }

        List<T> hits = convertHits(response);

        return new ElasticChunkResult<>(determineLastObject(response, hits), hits, elasticChunkResult.getChunkSize());
    }

    private Object determineLastObject(SearchResponse response, List<T> hits) {
        Object lastObject = null;

        if (!hits.isEmpty()) {
            lastObject = response.getHits().getAt(response.getHits().getHits().length - 1).getId();
        }
        return lastObject;
    }

system · April 22, 2021, 9:59am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
For exporting data shoud we use scroll or pit with search after? Elasticsearch	5	2306	October 19, 2022
Why is search_after preferred over Scroll API? Elasticsearch	2	1178	January 25, 2022
Random access pagination with search_after on Elasticsearch Elasticsearch	7	645	November 30, 2023
Search_after vs deep pagination Elasticsearch	5	13513	June 20, 2017
Which is better between Scroll and Search_After when extract lots of document to other database? Elasticsearch	1	253	July 14, 2022

Recommendation to use search after instead of scrolling

Related topics