Hi Clint,
The ES is at 0.19.2. The machine is a “test” VM machine running centos
6.2 running in Xen hypervisor with 32G of RAM and 2 4-core Xenon L5420
CPUs (all 8 cores, all RAM is allocated to that VM and no other VMs
are running there for the test purposes)
The “source” ES instance is running in that VM with 12G or RAM with 2
indexes 8 shards each with all of its data mounted off external USB
(that’s a backup copy of our production data).
For the purposes of this test I don’t even want to use “destination”
reindexing ES so no other issues are introduced, rather we scroll-
search and write output do disk (on the same thread, but can be made
multi-threaded), each doc into its own file.
For comparison, I have numbers for raw copy (cp –r from external drive
into internal drive), then scrolling and writing to disk from a single
index with scroll size 10 and 50, then 2 scrolls and 10 scrolls in
parallel each with size 50 and distinct id ranges are below. All data
was written for 10 minutes:
raw_copy: 17.6G
scroll size=10: 1.8G
scroll_size=50: 1.5G
2xscroll_size=50: 3.3G
10xscroll_size=50 4.8G
Even though these tests are with the ES as data source on external
USB, I have very similar write numbers on the EC2 (with 4 their drives
in RAID0)
I am not sure how close ES scroll can get to the true “cp –r” speed,
but currently it appears to do it about 10x slower for single scroll,
moreover parallelized scroll is faster the single scroll.
On somewhat related subject, running single scroll like this ended up
with an error about 5 hours into it – the result returned from search-
scroll had string length of zero (could be underlying json / rest
library, network etc). It appears to me that design-wise copying data
from one cluster to another requires maintaining a truth table with
keeping track of copied _ids so restarts are possible – is this a
common consensus?
Another question – is there a way to query by _id range (or some other
way to partition the data) so I can partition scrolls in generic way
(right now I have an internal id which is specific to my schema so it
just happened to work for me as I can partition on the range:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"range": {
"internal_id": {
"gte": "10000000",
"lte": "1000000000"
}
}
}
]
}
}
}
}
Thanks!
-- Andy
On Jun 14, 1:31 am, Clinton Gormley cl...@traveljury.com wrote:
Hi Andy
I am trying to copy data from one cluster to another -- using scroll
api to get scroll_id (/_search?search_type=scan&scroll=10m&size=ABC)
and then search-scroll (/_search/scroll?scroll=10m&scroll_id=XYZ) API
to fetch data from one cluster and and bulk insert it into another
cluster.
Everything works fine, but the search-scroll part is quite slow
(around 8G/hour) which results in re-indexing at about 10-50 documents
per second (as opposed to 500-2000 per second which I know the cluster
can handle) and for 2T cluster this is going to take quite a while. I
have tried changing scroll size, but it does not appear to make a
difference -- rather then returning small chunks often, it returns
larger chunks less often.
What size= are you using? What version of ES are you on? (I know there
was an improvement in scan/scroll speed a few versions ago).
Are there any ways to improve it? Should I run scroll by _id ranges in
parallel?
I'd think that would help. But it'd be good to figure out what is
slowing this down
clint
Thanks,
-- Andy