Apologies if this has been answered before; I've have a look at the
docs and archives but may have missed something. We've got a single
index on 0.18.6 running on an EC2 cluster that we want to back with
the s3 gateway. We generate a few GB of documents a day at a regular
pace (no huge spikes), and set a TTL of 15 days on all documents. As
such, we're hoping that the size of the data stored in s3 will remain
reasonably constant over time. Two questions:
1 - is that much data too much to hope to push to s3 all the time?
2 - will elasticsearch remove old data from s3 as it becomes unneeded?
First, my general recommendation is to use local gateway on ec2, its
considerably more lightweight compared to the s3 gateway. You can still
back it up ofcourse.
Apologies if this has been answered before; I've have a look at the
docs and archives but may have missed something. We've got a single
index on 0.18.6 running on an EC2 cluster that we want to back with
the s3 gateway. We generate a few GB of documents a day at a regular
pace (no huge spikes), and set a TTL of 15 days on all documents. As
such, we're hoping that the size of the data stored in s3 will remain
reasonably constant over time. Two questions:
1 - is that much data too much to hope to push to s3 all the time?
It should be fine.
2 - will elasticsearch remove old data from s3 as it becomes unneeded?
Not as data gets deleted. Deletes are marked to be deleted, and later on,
as the index performs merges, they will be merged out.
Thanks for the response. Our motivation to use the s3 gateway is that
for some of our environments we have a lot of documents on relatively
small clusters (because load is low) and were finding ourselves firing
up more ec2 instances just to get more local disk space, which isn't
very cost effective. We're trialling the s3 gateway for a couple of
weeks to see how it goes; we may investigate other options if we have
problems.
First, my general recommendation is to use local gateway on ec2, its
considerably more lightweight compared to the s3 gateway. You can still
back it up ofcourse.
Apologies if this has been answered before; I've have a look at the
docs and archives but may have missed something. We've got a single
index on 0.18.6 running on an EC2 cluster that we want to back with
the s3 gateway. We generate a few GB of documents a day at a regular
pace (no huge spikes), and set a TTL of 15 days on all documents. As
such, we're hoping that the size of the data stored in s3 will remain
reasonably constant over time. Two questions:
1 - is that much data too much to hope to push to s3 all the time?
It should be fine.
2 - will elasticsearch remove old data from s3 as it becomes unneeded?
Not as data gets deleted. Deletes are marked to be deleted, and later on,
as the index performs merges, they will be merged out.
Thanks for the response. Our motivation to use the s3 gateway is that
for some of our environments we have a lot of documents on relatively
small clusters (because load is low) and were finding ourselves firing
up more ec2 instances just to get more local disk space, which isn't
very cost effective. We're trialling the s3 gateway for a couple of
weeks to see how it goes; we may investigate other options if we have
problems.
First, my general recommendation is to use local gateway on ec2, its
considerably more lightweight compared to the s3 gateway. You can still
back it up ofcourse.
Apologies if this has been answered before; I've have a look at the
docs and archives but may have missed something. We've got a single
index on 0.18.6 running on an EC2 cluster that we want to back with
the s3 gateway. We generate a few GB of documents a day at a regular
pace (no huge spikes), and set a TTL of 15 days on all documents. As
such, we're hoping that the size of the data stored in s3 will remain
reasonably constant over time. Two questions:
1 - is that much data too much to hope to push to s3 all the time?
It should be fine.
2 - will elasticsearch remove old data from s3 as it becomes unneeded?
Not as data gets deleted. Deletes are marked to be deleted, and later on,
as the index performs merges, they will be merged out.
Yes, because the recovery on full cluster restart relies on hte latest
snapshotted data, while if it was based on the local gateway by default,
and only on backups on worst case scenario, then its a different story.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.