Hi Clint,
The parallel scrolls do speed up the retrieval, but parallelizing too
much (or increasing size scroll size to 1000 -- docs are about 5k
each) just blocks the IO with linux kernel complaining:
INFO: task kjournald:2337 blocked for more than 120 seconds.
INFO: task java:10643 blocked for more than 120 seconds.
Trying various combinations of parallel scroll, size of scrolls,
clients on same/different machine from the “src” server, etc) the best
I can get is about 25G per hour, which is 3x improvement from the
original single-scroll performance. I am using “http” client, but
network does not appear to be a bottleneck.
I am still not sure how to make copy from one cluster to another fully
restartable to mitigate failures – scan-scroll does not support
“sort”, so I cannot persist “last written _id” and restart from there.
The best I can do is a full restart of the failed “parallel” scroll.
Additionally, when I ask for scroll, with specified size, it gives
more results back then what asked for. Repro steps:
- POST /index_name/_search?search_type=scan&scroll=10m&size=1
- GET /_search/scroll?scroll=10m&scroll_id=XYZ
- BUG=> observe in [“hits”][“hits”] the number of documents is
greater than 1 (looks like equal to the number of shards). Not a big
deal, just need to aware of.
On Jun 15, 4:09 am, Clinton Gormley cl...@traveljury.com wrote:
Hiya
The ES is at 0.19.2.
OK, the improvement inscrollingwas added in 0.19.0https://github.com/elasticsearch/elasticsearch/issues/1579
For the purposes of this test I don’t even want to use “destination”
reindexing ES so no other issues are introduced, rather we scroll-
search and write output do disk (on the same thread, but can be made
multi-threaded), each doc into its own file.I'd also try just pulling data and throwing it away, to exclude the
"write" part from the tests - EC2 I/O is not renowned for performance.raw_copy: 17.6G
scroll size=10: 1.8G
scroll_size=50: 1.5G
2xscroll_size=50: 3.3G
10xscroll_size=50 4.8GSo with 10x you're still getting an increase in throughput. Does it
fall off with more processes?Also, do you have very large docs? size=50 is quite small for a scanned
scroll. I'd be doing 1,000 - 5,000, but if your docs ARE very large then
you may get worse performance. I see that your performance seems to get
worse from 10 to 50, so large docs may be the reason.I am not sure how close ES scroll can get to the true “cp –r”speed,
but currently it appears to do it about 10x slower for single scroll,
moreover parallelized scroll is faster the single scroll.You don't mention if you're running ES and the client on the same
machine (which would obviously impact throughput). Also, are you using
a java client or http? Something you may want to try is enabling
compression on the http requests (no idea if that will help or not).On somewhat related subject, running single scroll like this ended up
with an error about 5 hours into it – the result returned from search-
scroll had string length of zero (could be underlying json / rest
library, network etc). It appears to me that design-wise copying data
from one cluster to another requires maintaining a truth table with
keeping track of copied _ids so restarts are possible – is this a
common consensus?I think the easiest is probably to enable timestamps and to use range
filters on those.clint