Slow scrolling speed

Hi

I am trying to copy data from one cluster to another -- using scroll
api to get scroll_id (/_search?search_type=scan&scroll=10m&size=ABC)

and then search-scroll (/_search/scroll?scroll=10m&scroll_id=XYZ) API
to fetch data from one cluster and and bulk insert it into another
cluster.

Everything works fine, but the search-scroll part is quite slow
(around 8G/hour) which results in re-indexing at about 10-50 documents
per second (as opposed to 500-2000 per second which I know the cluster
can handle) and for 2T cluster this is going to take quite a while. I
have tried changing scroll size, but it does not appear to make a
difference -- rather then returning small chunks often, it returns
larger chunks less often.

Are there any ways to improve it? Should I run scroll by _id ranges in
parallel?

Thanks,

-- Andy

Hi Andy

I am trying to copy data from one cluster to another -- using scroll
api to get scroll_id (/_search?search_type=scan&scroll=10m&size=ABC)

and then search-scroll (/_search/scroll?scroll=10m&scroll_id=XYZ) API
to fetch data from one cluster and and bulk insert it into another
cluster.

Everything works fine, but the search-scroll part is quite slow
(around 8G/hour) which results in re-indexing at about 10-50 documents
per second (as opposed to 500-2000 per second which I know the cluster
can handle) and for 2T cluster this is going to take quite a while. I
have tried changing scroll size, but it does not appear to make a
difference -- rather then returning small chunks often, it returns
larger chunks less often.

What size= are you using? What version of ES are you on? (I know there
was an improvement in scan/scroll speed a few versions ago).

Are there any ways to improve it? Should I run scroll by _id ranges in
parallel?

I'd think that would help. But it'd be good to figure out what is
slowing this down

clint

Thanks,

-- Andy

Hi Clint,

The ES is at 0.19.2. The machine is a “test” VM machine running centos
6.2 running in Xen hypervisor with 32G of RAM and 2 4-core Xenon L5420
CPUs (all 8 cores, all RAM is allocated to that VM and no other VMs
are running there for the test purposes)

The “source” ES instance is running in that VM with 12G or RAM with 2
indexes 8 shards each with all of its data mounted off external USB
(that’s a backup copy of our production data).

For the purposes of this test I don’t even want to use “destination”
reindexing ES so no other issues are introduced, rather we scroll-
search and write output do disk (on the same thread, but can be made
multi-threaded), each doc into its own file.

For comparison, I have numbers for raw copy (cp –r from external drive
into internal drive), then scrolling and writing to disk from a single
index with scroll size 10 and 50, then 2 scrolls and 10 scrolls in
parallel each with size 50 and distinct id ranges are below. All data
was written for 10 minutes:

raw_copy: 17.6G
scroll size=10: 1.8G
scroll_size=50: 1.5G
2xscroll_size=50: 3.3G
10xscroll_size=50 4.8G

Even though these tests are with the ES as data source on external
USB, I have very similar write numbers on the EC2 (with 4 their drives
in RAID0)

I am not sure how close ES scroll can get to the true “cp –r” speed,
but currently it appears to do it about 10x slower for single scroll,
moreover parallelized scroll is faster the single scroll.

On somewhat related subject, running single scroll like this ended up
with an error about 5 hours into it – the result returned from search-
scroll had string length of zero (could be underlying json / rest
library, network etc). It appears to me that design-wise copying data
from one cluster to another requires maintaining a truth table with
keeping track of copied _ids so restarts are possible – is this a
common consensus?

Another question – is there a way to query by _id range (or some other
way to partition the data) so I can partition scrolls in generic way
(right now I have an internal id which is specific to my schema so it
just happened to work for me as I can partition on the range:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"range": {
"internal_id": {
"gte": "10000000",
"lte": "1000000000"
}
}
}
]
}
}
}
}

Thanks!

-- Andy

On Jun 14, 1:31 am, Clinton Gormley cl...@traveljury.com wrote:

Hi Andy

I am trying to copy data from one cluster to another -- using scroll
api to get scroll_id (/_search?search_type=scan&scroll=10m&size=ABC)

and then search-scroll (/_search/scroll?scroll=10m&scroll_id=XYZ) API
to fetch data from one cluster and and bulk insert it into another
cluster.

Everything works fine, but the search-scroll part is quite slow
(around 8G/hour) which results in re-indexing at about 10-50 documents
per second (as opposed to 500-2000 per second which I know the cluster
can handle) and for 2T cluster this is going to take quite a while. I
have tried changing scroll size, but it does not appear to make a
difference -- rather then returning small chunks often, it returns
larger chunks less often.

What size= are you using? What version of ES are you on? (I know there
was an improvement in scan/scroll speed a few versions ago).

Are there any ways to improve it? Should I run scroll by _id ranges in
parallel?

I'd think that would help. But it'd be good to figure out what is
slowing this down

clint

Thanks,

-- Andy

BTW, just realized that I have _source compressed in src ES, thus in
the test the destination for "cp -r" data is also compressed. However
during scroll performance write tests the destination is obviously not
compressed -- this will make the real scroll performance numbers even
worse.

On Jun 14, 12:48 pm, andym imwell...@gmail.com wrote:

Hi Clint,

The ES is at 0.19.2. The machine is a “test” VM machine running centos
6.2 running in Xen hypervisor with 32G of RAM and 2 4-core Xenon L5420
CPUs (all 8 cores, all RAM is allocated to that VM and no other VMs
are running there for the test purposes)

The “source” ES instance is running in that VM with 12G or RAM with 2
indexes 8 shards each with all of its data mounted off external USB
(that’s a backup copy of our production data).

For the purposes of this test I don’t even want to use “destination”
reindexing ES so no other issues are introduced, rather we scroll-
search and write output do disk (on the same thread, but can be made
multi-threaded), each doc into its own file.

For comparison, I have numbers for raw copy (cp –r from external drive
into internal drive), then scrolling and writing to disk from a single
index with scroll size 10 and 50, then 2 scrolls and 10 scrolls in
parallel each with size 50 and distinct id ranges are below. All data
was written for 10 minutes:

raw_copy: 17.6G
scroll size=10: 1.8G
scroll_size=50: 1.5G
2xscroll_size=50: 3.3G
10xscroll_size=50 4.8G

Even though these tests are with the ES as data source on external
USB, I have very similar write numbers on the EC2 (with 4 their drives
in RAID0)

I am not sure how close ES scroll can get to the true “cp –r” speed,
but currently it appears to do it about 10x slower for single scroll,
moreover parallelized scroll is faster the single scroll.

On somewhat related subject, running single scroll like this ended up
with an error about 5 hours into it – the result returned from search-
scroll had string length of zero (could be underlying json / rest
library, network etc). It appears to me that design-wise copying data
from one cluster to another requires maintaining a truth table with
keeping track of copied _ids so restarts are possible – is this a
common consensus?

Another question – is there a way to query by _id range (or some other
way to partition the data) so I can partition scrolls in generic way
(right now I have an internal id which is specific to my schema so it
just happened to work for me as I can partition on the range:
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"and": [
{
"range": {
"internal_id": {
"gte": "10000000",
"lte": "1000000000"
}
}
}
]
}
}
}

}

Thanks!

-- Andy

On Jun 14, 1:31 am, Clinton Gormley cl...@traveljury.com wrote:

Hi Andy

I am trying to copy data from one cluster to another -- using scroll
api to get scroll_id (/_search?search_type=scan&scroll=10m&size=ABC)

and then search-scroll (/_search/scroll?scroll=10m&scroll_id=XYZ) API
to fetch data from one cluster and and bulk insert it into another
cluster.

Everything works fine, but the search-scroll part is quite slow
(around 8G/hour) which results in re-indexing at about 10-50 documents
per second (as opposed to 500-2000 per second which I know the cluster
can handle) and for 2T cluster this is going to take quite a while. I
have tried changing scroll size, but it does not appear to make a
difference -- rather then returning small chunks often, it returns
larger chunks less often.

What size= are you using? What version of ES are you on? (I know there
was an improvement in scan/scroll speed a few versions ago).

Are there any ways to improve it? Should I run scroll by _id ranges in
parallel?

I'd think that would help. But it'd be good to figure out what is
slowing this down

clint

Thanks,

-- Andy

Hiya

The ES is at 0.19.2.

OK, the improvement in scrolling was added in 0.19.0

For the purposes of this test I don’t even want to use “destination”
reindexing ES so no other issues are introduced, rather we scroll-
search and write output do disk (on the same thread, but can be made
multi-threaded), each doc into its own file.

I'd also try just pulling data and throwing it away, to exclude the
"write" part from the tests - EC2 I/O is not renowned for performance.

raw_copy: 17.6G
scroll size=10: 1.8G
scroll_size=50: 1.5G
2xscroll_size=50: 3.3G
10xscroll_size=50 4.8G

So with 10x you're still getting an increase in throughput. Does it
fall off with more processes?

Also, do you have very large docs? size=50 is quite small for a scanned
scroll. I'd be doing 1,000 - 5,000, but if your docs ARE very large then
you may get worse performance. I see that your performance seems to get
worse from 10 to 50, so large docs may be the reason.

I am not sure how close ES scroll can get to the true “cp –r” speed,
but currently it appears to do it about 10x slower for single scroll,
moreover parallelized scroll is faster the single scroll.

You don't mention if you're running ES and the client on the same
machine (which would obviously impact throughput). Also, are you using
a java client or http? Something you may want to try is enabling
compression on the http requests (no idea if that will help or not).

On somewhat related subject, running single scroll like this ended up
with an error about 5 hours into it – the result returned from search-
scroll had string length of zero (could be underlying json / rest
library, network etc). It appears to me that design-wise copying data
from one cluster to another requires maintaining a truth table with
keeping track of copied _ids so restarts are possible – is this a
common consensus?

I think the easiest is probably to enable timestamps and to use range
filters on those.

clint

Hi Clint,

The parallel scrolls do speed up the retrieval, but parallelizing too
much (or increasing size scroll size to 1000 -- docs are about 5k
each) just blocks the IO with linux kernel complaining:

INFO: task kjournald:2337 blocked for more than 120 seconds.
INFO: task java:10643 blocked for more than 120 seconds.

Trying various combinations of parallel scroll, size of scrolls,
clients on same/different machine from the “src” server, etc) the best
I can get is about 25G per hour, which is 3x improvement from the
original single-scroll performance. I am using “http” client, but
network does not appear to be a bottleneck.

I am still not sure how to make copy from one cluster to another fully
restartable to mitigate failures – scan-scroll does not support
“sort”, so I cannot persist “last written _id” and restart from there.
The best I can do is a full restart of the failed “parallel” scroll.

Additionally, when I ask for scroll, with specified size, it gives
more results back then what asked for. Repro steps:

  1. POST /index_name/_search?search_type=scan&scroll=10m&size=1
  2. GET /_search/scroll?scroll=10m&scroll_id=XYZ
  3. BUG=> observe in [“hits”][“hits”] the number of documents is
    greater than 1 (looks like equal to the number of shards). Not a big
    deal, just need to aware of.

On Jun 15, 4:09 am, Clinton Gormley cl...@traveljury.com wrote:

Hiya

The ES is at 0.19.2.

OK, the improvement inscrollingwas added in 0.19.0https://github.com/elasticsearch/elasticsearch/issues/1579

For the purposes of this test I don’t even want to use “destination”
reindexing ES so no other issues are introduced, rather we scroll-
search and write output do disk (on the same thread, but can be made
multi-threaded), each doc into its own file.

I'd also try just pulling data and throwing it away, to exclude the
"write" part from the tests - EC2 I/O is not renowned for performance.

raw_copy: 17.6G
scroll size=10: 1.8G
scroll_size=50: 1.5G
2xscroll_size=50: 3.3G
10xscroll_size=50 4.8G

So with 10x you're still getting an increase in throughput. Does it
fall off with more processes?

Also, do you have very large docs? size=50 is quite small for a scanned
scroll. I'd be doing 1,000 - 5,000, but if your docs ARE very large then
you may get worse performance. I see that your performance seems to get
worse from 10 to 50, so large docs may be the reason.

I am not sure how close ES scroll can get to the true “cp –r”speed,
but currently it appears to do it about 10x slower for single scroll,
moreover parallelized scroll is faster the single scroll.

You don't mention if you're running ES and the client on the same
machine (which would obviously impact throughput). Also, are you using
a java client or http? Something you may want to try is enabling
compression on the http requests (no idea if that will help or not).

On somewhat related subject, running single scroll like this ended up
with an error about 5 hours into it – the result returned from search-
scroll had string length of zero (could be underlying json / rest
library, network etc). It appears to me that design-wise copying data
from one cluster to another requires maintaining a truth table with
keeping track of copied _ids so restarts are possible – is this a
common consensus?

I think the easiest is probably to enable timestamps and to use range
filters on those.

clint

On Monday, June 18, 2012 1:33:15 PM UTC-7, andym wrote:

I am still not sure how to make copy from one cluster to another fully
restartable to mitigate failures – scan-scroll does not support
“sort”, so I cannot persist “last written _id” and restart from there.

I also wish documents could be scrolled in insertion order... Is this an
impossible request, or something worth creating an issue for?

Hi Andy

I am still not sure how to make copy from one cluster to another fully
restartable to mitigate failures – scan-scroll does not support
“sort”, so I cannot persist “last written _id” and restart from there.
The best I can do is a full restart of the failed “parallel” scroll.

But if you have a timestamp on your docs, then you can filter out
anything that changed before the max timestamp.

Additionally, when I ask for scroll, with specified size, it gives
more results back then what asked for. Repro steps:

  1. POST /index_name/_search?search_type=scan&scroll=10m&size=1
  2. GET /_search/scroll?scroll=10m&scroll_id=XYZ
  3. BUG=> observe in [“hits”][“hits”] the number of documents is
    greater than 1 (looks like equal to the number of shards). Not a big
    deal, just need to aware of.

Not a bug - it's the nature of the beast.

Normally when you do a search in ES (for eg 10 results) it will request
the top 10 results from each shard (5 x 10 = 50) then find the top 10
docs out of those 50 and return just those, discarding the other 40.

With scan, it just returns all 50 results. So on each request, you will
have a maximum of $size x $number_of_primary_shards

clint

Hi Clint,

I am still not sure how to make copy from one cluster to another fully
restartable to mitigate failures – scan-scroll does not support
“sort”, so I cannot persist “last written _id” and restart from there.
The best I can do is a full restart of the failed “parallel” scroll.

But if you have a timestamp on your docs, then you can filter out
anything that changed before the max timestamp.

Your _timestamp suggestion would work nicely in the case when “src”
cluster is changing while the copying into “dst” cluster is going on
and copying itself takes relatively short time.

My question is more about the case when the cluster copying itself
takes a long time (say 1 week), is interrupted in the middle of it due
to some underlying failures (network, etc) and needs to be restarted.
For example, let’s say we have have 1B documents in “src” cluster each
with {_id, _timestamp} fields and for simplicity, no updates are
happening in “src” cluster. One starts scroll-copying the docs from
“src” cluster into “dst” cluster and the copying process dies 2 days
later with 0.2B docs copied. Given that one cannot specify “sort” in
the scroll, one does not know the range of 0.2B docs that were copied
(neither _id range nor _timestamp range), hence one does not know for
which range (_id, or _timestamp) to ask when restarting the scroll.

Now, during copying one could save the “copied” _ids (or _timestamps)
somewhere, on restart sort _ids and do scroll on “holes” in _id
ranges. Given that parallelization speeds up the scrolling, also need
to keep the track of “concurrent scrolls” and maintain the minimum
parallel scrolls as the “holes” are getting filled up. However I am
not sure this is a right approach – I suspect the performance with
fragmented scrolls likely would suffer, especially with multiple
restarts and the “restart” logic becomes more complicated than just
few API calls.

Alternatively one can specify larger scroll duration (e.g. several
days, as in /search?search_type=scan&scroll=15000m&size=1 then persist
“last” _scroll_id per write to “dest” cluster) and on restart rather
than asking for a new scroll_id continue scan-scroll from the
persisted _scroll_id.

Would the latter approach work? Any implications of having “large”
scroll duration? (if they do take memory, is there way to “free()” the
resources besides timeout? Would “src” cluster restart "free()" used
scroll resource?)

One more thing I haven't tried to increase scroll speed -- do you
think increasing the number of replicas in "src" cluster (say to 10)
would increase the scroll performance?

-- Andy