Deleting s3 gateway data


(Steve-2) #1

Hi,

Apologies if this has been answered before; I've have a look at the
docs and archives but may have missed something. We've got a single
index on 0.18.6 running on an EC2 cluster that we want to back with
the s3 gateway. We generate a few GB of documents a day at a regular
pace (no huge spikes), and set a TTL of 15 days on all documents. As
such, we're hoping that the size of the data stored in s3 will remain
reasonably constant over time. Two questions:

1 - is that much data too much to hope to push to s3 all the time?

2 - will elasticsearch remove old data from s3 as it becomes unneeded?

Thanks!

Steve


(Shay Banon) #2

First, my general recommendation is to use local gateway on ec2, its
considerably more lightweight compared to the s3 gateway. You can still
back it up ofcourse.

On Tue, May 8, 2012 at 9:59 PM, Steve steve1.mclellan@googlemail.comwrote:

Hi,

Apologies if this has been answered before; I've have a look at the
docs and archives but may have missed something. We've got a single
index on 0.18.6 running on an EC2 cluster that we want to back with
the s3 gateway. We generate a few GB of documents a day at a regular
pace (no huge spikes), and set a TTL of 15 days on all documents. As
such, we're hoping that the size of the data stored in s3 will remain
reasonably constant over time. Two questions:

1 - is that much data too much to hope to push to s3 all the time?

It should be fine.

2 - will elasticsearch remove old data from s3 as it becomes unneeded?

Not as data gets deleted. Deletes are marked to be deleted, and later on,
as the index performs merges, they will be merged out.

Thanks!

Steve


(Steve-2) #3

Hi Shay,

Thanks for the response. Our motivation to use the s3 gateway is that
for some of our environments we have a lot of documents on relatively
small clusters (because load is low) and were finding ourselves firing
up more ec2 instances just to get more local disk space, which isn't
very cost effective. We're trialling the s3 gateway for a couple of
weeks to see how it goes; we may investigate other options if we have
problems.

Steve

On May 10, 1:02 am, Shay Banon kim...@gmail.com wrote:

First, my general recommendation is to use local gateway on ec2, its
considerably more lightweight compared to the s3 gateway. You can still
back it up ofcourse.

On Tue, May 8, 2012 at 9:59 PM, Steve steve1.mclel...@googlemail.comwrote:

Hi,

Apologies if this has been answered before; I've have a look at the
docs and archives but may have missed something. We've got a single
index on 0.18.6 running on an EC2 cluster that we want to back with
the s3 gateway. We generate a few GB of documents a day at a regular
pace (no huge spikes), and set a TTL of 15 days on all documents. As
such, we're hoping that the size of the data stored in s3 will remain
reasonably constant over time. Two questions:

1 - is that much data too much to hope to push to s3 all the time?

It should be fine.

2 - will elasticsearch remove old data from s3 as it becomes unneeded?

Not as data gets deleted. Deletes are marked to be deleted, and later on,
as the index performs merges, they will be merged out.

Thanks!

Steve


(Shay Banon) #4

With the s3 gateway, you still have the indexes stored on the nodes data
location, they are just snapshotted periodically to s3.

On Fri, May 11, 2012 at 6:21 PM, Steve steve1.mclellan@googlemail.comwrote:

Hi Shay,

Thanks for the response. Our motivation to use the s3 gateway is that
for some of our environments we have a lot of documents on relatively
small clusters (because load is low) and were finding ourselves firing
up more ec2 instances just to get more local disk space, which isn't
very cost effective. We're trialling the s3 gateway for a couple of
weeks to see how it goes; we may investigate other options if we have
problems.

Steve

On May 10, 1:02 am, Shay Banon kim...@gmail.com wrote:

First, my general recommendation is to use local gateway on ec2, its
considerably more lightweight compared to the s3 gateway. You can still
back it up ofcourse.

On Tue, May 8, 2012 at 9:59 PM, Steve <steve1.mclel...@googlemail.com
wrote:

Hi,

Apologies if this has been answered before; I've have a look at the
docs and archives but may have missed something. We've got a single
index on 0.18.6 running on an EC2 cluster that we want to back with
the s3 gateway. We generate a few GB of documents a day at a regular
pace (no huge spikes), and set a TTL of 15 days on all documents. As
such, we're hoping that the size of the data stored in s3 will remain
reasonably constant over time. Two questions:

1 - is that much data too much to hope to push to s3 all the time?

It should be fine.

2 - will elasticsearch remove old data from s3 as it becomes unneeded?

Not as data gets deleted. Deletes are marked to be deleted, and later on,
as the index performs merges, they will be merged out.

Thanks!

Steve


(Eric Jain) #5

On Tue, May 15, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:

With the s3 gateway, you still have the indexes stored on the nodes data
location, they are just snapshotted periodically to s3.

Is this snapshotting less efficient than other backup options?


(Shay Banon) #6

Yes, because the recovery on full cluster restart relies on hte latest
snapshotted data, while if it was based on the local gateway by default,
and only on backups on worst case scenario, then its a different story.

On Tue, May 15, 2012 at 11:57 PM, Eric Jain eric.jain@gmail.com wrote:

On Tue, May 15, 2012 at 12:54 PM, Shay Banon kimchy@gmail.com wrote:

With the s3 gateway, you still have the indexes stored on the nodes data
location, they are just snapshotted periodically to s3.

Is this snapshotting less efficient than other backup options?


(system) #7