Adding S3 gateway on a local-gateway machine


(mrflip) #1

We have a machine set up currently with a local gateway (full config
at https://gist.github.com/f003c19dd0ce53c654cb )

gateway:
type: local
index:
gateway:
snapshot_interval: -1
snapshot_on_close: false

We're considering moving the cluster to use the S3 gateway. It's a 16-
machine cluster; when all is done it will hold about 11 indexes, 176
shards x 2 (replicas = 1), each of about 5-15GB actual-on-disk usage.

Can we switch the cluster over to use the s3 gateway without losing
files?
I know I'll have to trigger a snapshot using eg.
curl -XPOST 'http://localhost:9200/_gateway/snapshot'
My concern is that once I update the config, I'll have to restart each
data node; will it try to initiate recovery from the (empty) s3
gateway, or can I make it adopt the local files already presence and
then push them to S3 after going green?

Also, are there any non-obvious performance implications for pushing
that much data through s3? Will new nodes recover from their peers or
pull from s3?

thanks,
flip


(Shay Banon) #2

Hi,

There is no way to switch from local to gateway without reindexing the
data.

Regarding the overhead of s3, there are basically two. The first is the
initial recovery on full cluster startup. If you set the
gateway.recovery_after_xxx settings, then shards will be allocated to nodes
that have the most common local data with regards to s3, so the recovery
times should be minimal.

The second problem with s3 is more concerning, which is the need to push
the data to s3. This will require network resources, which are very rare on
ec2 ;), and will compete with indexing / searching network operations... .

-shay.banon

On Fri, Dec 24, 2010 at 1:12 AM, mrflip mrflip@gmail.com wrote:

We have a machine set up currently with a local gateway (full config
at https://gist.github.com/f003c19dd0ce53c654cb )

gateway:
type: local
index:
gateway:
snapshot_interval: -1
snapshot_on_close: false

We're considering moving the cluster to use the S3 gateway. It's a 16-
machine cluster; when all is done it will hold about 11 indexes, 176
shards x 2 (replicas = 1), each of about 5-15GB actual-on-disk usage.

Can we switch the cluster over to use the s3 gateway without losing
files?
I know I'll have to trigger a snapshot using eg.
curl -XPOST 'http://localhost:9200/_gateway/snapshot'
My concern is that once I update the config, I'll have to restart each
data node; will it try to initiate recovery from the (empty) s3
gateway, or can I make it adopt the local files already presence and
then push them to S3 after going green?

Also, are there any non-obvious performance implications for pushing
that much data through s3? Will new nodes recover from their peers or
pull from s3?

thanks,
flip


(system) #3