gateway:
type: local
index:
gateway:
snapshot_interval: -1
snapshot_on_close: false
We're considering moving the cluster to use the S3 gateway. It's a 16-
machine cluster; when all is done it will hold about 11 indexes, 176
shards x 2 (replicas = 1), each of about 5-15GB actual-on-disk usage.
Can we switch the cluster over to use the s3 gateway without losing
files?
I know I'll have to trigger a snapshot using eg.
curl -XPOST 'http://localhost:9200/_gateway/snapshot'
My concern is that once I update the config, I'll have to restart each
data node; will it try to initiate recovery from the (empty) s3
gateway, or can I make it adopt the local files already presence and
then push them to S3 after going green?
Also, are there any non-obvious performance implications for pushing
that much data through s3? Will new nodes recover from their peers or
pull from s3?
There is no way to switch from local to gateway without reindexing the
data.
Regarding the overhead of s3, there are basically two. The first is the
initial recovery on full cluster startup. If you set the
gateway.recovery_after_xxx settings, then shards will be allocated to nodes
that have the most common local data with regards to s3, so the recovery
times should be minimal.
The second problem with s3 is more concerning, which is the need to push
the data to s3. This will require network resources, which are very rare on
ec2 ;), and will compete with indexing / searching network operations... .
gateway:
type: local
index:
gateway:
snapshot_interval: -1
snapshot_on_close: false
We're considering moving the cluster to use the S3 gateway. It's a 16-
machine cluster; when all is done it will hold about 11 indexes, 176
shards x 2 (replicas = 1), each of about 5-15GB actual-on-disk usage.
Can we switch the cluster over to use the s3 gateway without losing
files?
I know I'll have to trigger a snapshot using eg.
curl -XPOST 'http://localhost:9200/_gateway/snapshot'
My concern is that once I update the config, I'll have to restart each
data node; will it try to initiate recovery from the (empty) s3
gateway, or can I make it adopt the local files already presence and
then push them to S3 after going green?
Also, are there any non-obvious performance implications for pushing
that much data through s3? Will new nodes recover from their peers or
pull from s3?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.