Elasticsearch scan of all IDs slows down dramatically (exponentially?)

I'm trying to dump all IDs from an index. The dump starts with an ETA of under ten minutes, but then progressively slows down, ultimately taking over an hour to finish.

Configuration: single node, two shards, no replica), running on an AWS r5.4xlarge, 30Gb heap, running in a docker container. ES 7.8. Using custom IDs. Docs range in size from a few 10s of bytes to 1-2Kb.

I would expect a total delivery time linear with the shard size, but it seems to get exponentially slower (I also tried this on a much bigger shard size (45Gb) and it looked like it would never finish).

Docker stats:

CONTAINER ID        NAME                CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
56adac7cf174        es             184.92%             32.24GiB / 120GiB   26.87%              132GB / 46.3GB      83.3GB / 1.22TB     180

Indices status:

% curl localhost:9200/_cat/indices?v

health status index                  uuid                   pri rep docs.count docs.deleted store.size pri.store.size
green  open   index-name             nCRb7R1-QKmJwJEhqPLJtA   2   0   16753650          702     14.1gb         14.1gb

With this code:

from elasticsearch.helpers import scan
from tqdm import tqdm

def get_ids(index):
    body={'query': {'match_all':{}}}
    for doc in scan(ES_CLIENT, query=body, index=index, _source=False):
        yield doc['_id']

with open("es-ids.txt", "w") as fd:
  for id in tqdm(get_ids("index-name"), total=16753724):

36%|███     |  6042901/16753724 [07:33<22:11, 8043.51it/s]
70%|█████   | 11644101/16753724 [18:59<14:24, 5908.57it/s]
85%|██████  | 14171201/16753724 [29:18<16:15, 2647.56it/s]
90%|███████ | 15072301/16753724 [35:56<18:29, 1515.32it/s]

Note the rapidly decreasing iteration rate. It's like at any given interval it's taking the same amount of time to get halfway through the remaining docs.

Tried this again on a three-node cluster, using AWS elasticsearch service (each node an r5.large).

Similar results, slower on the overall spool time (although this might be due to 3 x r5.large versus a single r5.4xlarge). There is still an increasing lag as the process gets closer to the end.

No significant difference on a 2-node AWS ES cluster (2 x r4.2xlarge).

Bumping up the scan 'size' parameter to 5k from its default (500) seems to improve the overall throughput, with less of a slowdown, but the slowdown is still present.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.