I'm trying to dump all IDs from an index. The dump starts with an ETA of under ten minutes, but then progressively slows down, ultimately taking over an hour to finish.
Configuration: single node, two shards, no replica), running on an AWS r5.4xlarge, 30Gb heap, running in a docker container. ES 7.8. Using custom IDs. Docs range in size from a few 10s of bytes to 1-2Kb.
I would expect a total delivery time linear with the shard size, but it seems to get exponentially slower (I also tried this on a much bigger shard size (45Gb) and it looked like it would never finish).
Docker stats:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
56adac7cf174 es 184.92% 32.24GiB / 120GiB 26.87% 132GB / 46.3GB 83.3GB / 1.22TB 180
Indices status:
% curl localhost:9200/_cat/indices?v
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
green open index-name nCRb7R1-QKmJwJEhqPLJtA 2 0 16753650 702 14.1gb 14.1gb
With this code:
from elasticsearch.helpers import scan
from tqdm import tqdm
def get_ids(index):
body={'query': {'match_all':{}}}
for doc in scan(ES_CLIENT, query=body, index=index, _source=False):
yield doc['_id']
with open("es-ids.txt", "w") as fd:
for id in tqdm(get_ids("index-name"), total=16753724):
fx=fd.write(f"{id}\n")
36%|███ | 6042901/16753724 [07:33<22:11, 8043.51it/s]
70%|█████ | 11644101/16753724 [18:59<14:24, 5908.57it/s]
85%|██████ | 14171201/16753724 [29:18<16:15, 2647.56it/s]
90%|███████ | 15072301/16753724 [35:56<18:29, 1515.32it/s]