Snapshot Duration / Curator / Index Selection

As noted here --
https://groups.google.com/forum/#!searchin/elasticsearch/snapshot$20duration/elasticsearch/bCKenCVFf2o/TFK-Es0wxSwJ
-- the time it takes to perform a snapshot increases the more snapshots you
take. This eventually can become untenable. So far, the only solution
seems to be either: trim snapshots or snapshot into a new repository,
resetting the performance.

  1. When I perform snapshots, I want to snapshot all indices. However, all
    of my indices are timestamped logstash-style. The only index that receives
    new documents is todays. I would think Elasticsearch could optimize for
    this and not look through all the snapshots if the index is older than
    today. If there was some mechanism to indicate an index was frozen
    (read-only), then snapshotting could be very fast. Query the 'frozenTime'
    for all indices and only try to update the unfrozen snapshots.

  2. I can sort of solve the above problem by just snapshotting today's
    indices, but then restore is cumbersome. Say I retained 60 days worth of
    data; that means I'd have to retain 60 days worth of snapshots. And to do
    the restore, I'd have to restore all 60 snapshots. The situation gets
    worse if I wanted to snapshot multiple times a day.

  3. While not really fixing the crux of the problem, the curator script
    could help here. Right now, you can only trim snapshots --older-than some
    date. But what if there was a --thin option. Say I take snapshots every
    hour; I'm only really interested in that precision of snapshots for the
    past 24 hours. A backup from a month ago at 12pm is not much different to
    me than a month ago at 1pm. The proposed --thin option would look
    something like this:

curator snapshot --thin-older-than 1 --retain-copies 1.

This would delete all but the last snapshot for each day.

I'd love to hear thoughts on this and how people are currently solving this
problem in an automated way.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/11701b07-09da-4643-be18-19aae7764ce9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi! Sorry to keep you waiting, but I've been traveling. I completely
misunderstood how snapshots worked when they first came out, so when I
first wrote Curator's snapshot module it would only snap a complete index
once, and never re-snapshot an index if it appeared in the repository.
Then I learned more deeply about how snapshots worked and rewrote
Curator's snapshot functionality to what it is now. You can refer to this
issue on GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
if you like, where I explain how this all works with a thought experiment.
A direct quote:

When I initially coded snapshot behavior, I thought that snapshots would

pile up additional copies of indices, somehow, even though I knew they
supported incremental backups. What I learned is that snapshots capture at
the segment level. As long as the segment is referenced by any other
snapshot in the system, deleting a snapshot will not delete the referenced
data. Subsequent snapshots of the same indices simply back up any new
segments since the last snapshot in that repository (incrementals).

So to answer your question number 1, Elasticsearch does optimize for
this. Though curator references indices in a snapshot, only new,
not-yet-snapshotted segments are added to the new snapshot. It's
completely incremental.

The referenced GitHub issue
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
answers question number 2 by showing how keeping 60 days of snapshots would
allow you to restore the most recent 60 days of indices from only the
snapshot taken on day 60.

The answer to question 3 may be addressed by my suggested use-case for
time-series indices in that same GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
link.

I also responded to the thread you linked
https://groups.google.com/forum/#!searchin/elasticsearch/snapshot$20duration/elasticsearch/bCKenCVFf2o/TFK-Es0wxSwJ
:

Snapshots are at the segment level. The more segments stored in the

repository, the more segments will have to be compared to those in each
successive snapshot. With merges taking place continually in an active
index, you may end up with a considerable number of "orphaned" segments
stored in your repository, i.e. segments "backed up," but no longer
directly correlating to a segment in your index. Checking through these
may be contributing to the increased amount of time between snapshots.

Consider pruning older snapshots. "Orphaned" segments will be deleted,
and any segments still referenced will be preserved.

This does not affect you in the same way if you have time-series indices
(e.g. one per day). My suggested use-case, as mentioned previously, is to
have 2 repositories: one for frequent snapshots (before daily index
optimization [force merge to smaller number of segments per shard]), and
one for optimized daily snapshots. In this way you could keep hourly
snapshots of the last few daily indices in one repository, and daily
snapshots of your optimized indices in another. This prevents the
slow-down by reducing the number of segments the repositories must search
through for both hourly and daily snapshots.

--Aaron

On Tuesday, December 2, 2014 1:20:25 PM UTC-5, Matt Hughes wrote:

As noted here --
Redirecting to Google Groups
-- the time it takes to perform a snapshot increases the more snapshots you
take. This eventually can become untenable. So far, the only solution
seems to be either: trim snapshots or snapshot into a new repository,
resetting the performance.

  1. When I perform snapshots, I want to snapshot all indices. However, all
    of my indices are timestamped logstash-style. The only index that receives
    new documents is todays. I would think Elasticsearch could optimize for
    this and not look through all the snapshots if the index is older than
    today. If there was some mechanism to indicate an index was frozen
    (read-only), then snapshotting could be very fast. Query the 'frozenTime'
    for all indices and only try to update the unfrozen snapshots.

  2. I can sort of solve the above problem by just snapshotting today's
    indices, but then restore is cumbersome. Say I retained 60 days worth of
    data; that means I'd have to retain 60 days worth of snapshots. And to do
    the restore, I'd have to restore all 60 snapshots. The situation gets
    worse if I wanted to snapshot multiple times a day.

  3. While not really fixing the crux of the problem, the curator script
    could help here. Right now, you can only trim snapshots --older-than some
    date. But what if there was a --thin option. Say I take snapshots every
    hour; I'm only really interested in that precision of snapshots for the
    past 24 hours. A backup from a month ago at 12pm is not much different to
    me than a month ago at 1pm. The proposed --thin option would look
    something like this:

curator snapshot --thin-older-than 1 --retain-copies 1.

This would delete all but the last snapshot for each day.

I'd love to hear thoughts on this and how people are currently solving
this problem in an automated way.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/79a199d9-1ba7-47ae-a602-8e05e4a02472%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thanks for the speedy reply.

As for 1, I understand that ES optimizes for storage as snapshots of the
same index accumulate; I just wish it could also optimize for performance.
Right now, with a measly 4.5 gig cluster, the difference between the
snapshot 1 and snapshot 24 is 8 minutes. If I kept 15 days worth of
hourlies, I could see the snapshot time going past an hour.

Your suggestion of having two repositories is very helpful. Do you have
any interest in a PR for the --thin-older-than feature I proposed above? I
believe that would solve the same problem as the two repositories but only
require one. So say you wanted 15 days worth of dailys and 2 days worth of
hourlies, you could run something like this every hour:

curator snapshot --all-indices --repository backup
curator snapshot --delete-older-than 15 --repository backup
curator snapshot --thin-older-than 2 --repository backup

As described above, the --thin-older-than would remove all but N snapshots
per time-unit where N is defaulted to 1.

On Wednesday, December 3, 2014 7:35:49 AM UTC-5, Aaron Mildenstein wrote:

Hi! Sorry to keep you waiting, but I've been traveling. I completely
misunderstood how snapshots worked when they first came out, so when I
first wrote Curator's snapshot module it would only snap a complete index
once, and never re-snapshot an index if it appeared in the repository.
Then I learned more deeply about how snapshots worked and rewrote
Curator's snapshot functionality to what it is now. You can refer to
this issue on GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
if you like, where I explain how this all works with a thought experiment.
A direct quote:

When I initially coded snapshot behavior, I thought that snapshots would

pile up additional copies of indices, somehow, even though I knew they
supported incremental backups. What I learned is that snapshots capture at
the segment level. As long as the segment is referenced by any other
snapshot in the system, deleting a snapshot will not delete the referenced
data. Subsequent snapshots of the same indices simply back up any new
segments since the last snapshot in that repository (incrementals).

So to answer your question number 1, Elasticsearch does optimize for
this. Though curator references indices in a snapshot, only new,
not-yet-snapshotted segments are added to the new snapshot. It's
completely incremental.

The referenced GitHub issue
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
answers question number 2 by showing how keeping 60 days of snapshots would
allow you to restore the most recent 60 days of indices from only the
snapshot taken on day 60.

The answer to question 3 may be addressed by my suggested use-case for
time-series indices in that same GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
link.

I also responded to the thread you linked
https://groups.google.com/forum/#!searchin/elasticsearch/snapshot$20duration/elasticsearch/bCKenCVFf2o/TFK-Es0wxSwJ
:

Snapshots are at the segment level. The more segments stored in the

repository, the more segments will have to be compared to those in each
successive snapshot. With merges taking place continually in an active
index, you may end up with a considerable number of "orphaned" segments
stored in your repository, i.e. segments "backed up," but no longer
directly correlating to a segment in your index. Checking through these
may be contributing to the increased amount of time between snapshots.

Consider pruning older snapshots. "Orphaned" segments will be deleted,
and any segments still referenced will be preserved.

This does not affect you in the same way if you have time-series indices
(e.g. one per day). My suggested use-case, as mentioned previously, is to
have 2 repositories: one for frequent snapshots (before daily index
optimization [force merge to smaller number of segments per shard]), and
one for optimized daily snapshots. In this way you could keep hourly
snapshots of the last few daily indices in one repository, and daily
snapshots of your optimized indices in another. This prevents the
slow-down by reducing the number of segments the repositories must search
through for both hourly and daily snapshots.

--Aaron

On Tuesday, December 2, 2014 1:20:25 PM UTC-5, Matt Hughes wrote:

As noted here --
Redirecting to Google Groups
-- the time it takes to perform a snapshot increases the more snapshots you
take. This eventually can become untenable. So far, the only solution
seems to be either: trim snapshots or snapshot into a new repository,
resetting the performance.

  1. When I perform snapshots, I want to snapshot all indices. However,
    all of my indices are timestamped logstash-style. The only index that
    receives new documents is todays. I would think Elasticsearch could
    optimize for this and not look through all the snapshots if the index is
    older than today. If there was some mechanism to indicate an index was
    frozen (read-only), then snapshotting could be very fast. Query the
    'frozenTime' for all indices and only try to update the unfrozen
    snapshots.

  2. I can sort of solve the above problem by just snapshotting today's
    indices, but then restore is cumbersome. Say I retained 60 days worth of
    data; that means I'd have to retain 60 days worth of snapshots. And to do
    the restore, I'd have to restore all 60 snapshots. The situation gets
    worse if I wanted to snapshot multiple times a day.

  3. While not really fixing the crux of the problem, the curator script
    could help here. Right now, you can only trim snapshots --older-than some
    date. But what if there was a --thin option. Say I take snapshots every
    hour; I'm only really interested in that precision of snapshots for the
    past 24 hours. A backup from a month ago at 12pm is not much different to
    me than a month ago at 1pm. The proposed --thin option would look
    something like this:

curator snapshot --thin-older-than 1 --retain-copies 1.

This would delete all but the last snapshot for each day.

I'd love to hear thoughts on this and how people are currently solving
this problem in an automated way.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e76eaba-97d5-4447-89d9-46cfb65ba6d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

A “thin” option cannot help with snapshots as they reference segments. If a segment exists in a time series index from 1 month ago at 12pm and hasn’t changed by 1 month ago at 1pm, it is only stored once in the repository. Though you have multiple “snapshots,” each segment is only ever backed up once.

See immutability to understand why a segment, once backed up, doesn’t need to be re-backed up.

This is why, for time-series data, it makes sense to have 2 repositories. One for “live” data where segments are constantly merging (and therefore creating newer, bigger segments while deleting the older ones), and one for “cold” data where the segments will not be changing any more (hopefully you’ll have merged/optimized them to only a few segments per shard by this point).

So, “thinning” is really only good for pruning snapshots filled with rapidly changing segments (the “orphans” I referenced before), and that’s only useful with the —older-than flag that is already there.

—Aaron

On Wed, Dec 3, 2014 at 8:57 AM, Matt Hughes hughes.matt@gmail.com wrote:

Thanks for the speedy reply.
As for 1, I understand that ES optimizes for storage as snapshots of the
same index accumulate; I just wish it could also optimize for performance.
Right now, with a measly 4.5 gig cluster, the difference between the
snapshot 1 and snapshot 24 is 8 minutes. If I kept 15 days worth of
hourlies, I could see the snapshot time going past an hour.
Your suggestion of having two repositories is very helpful. Do you have
any interest in a PR for the --thin-older-than feature I proposed above? I
believe that would solve the same problem as the two repositories but only
require one. So say you wanted 15 days worth of dailys and 2 days worth of
hourlies, you could run something like this every hour:
curator snapshot --all-indices --repository backup
curator snapshot --delete-older-than 15 --repository backup
curator snapshot --thin-older-than 2 --repository backup
As described above, the --thin-older-than would remove all but N snapshots
per time-unit where N is defaulted to 1.
On Wednesday, December 3, 2014 7:35:49 AM UTC-5, Aaron Mildenstein wrote:

Hi! Sorry to keep you waiting, but I've been traveling. I completely
misunderstood how snapshots worked when they first came out, so when I
first wrote Curator's snapshot module it would only snap a complete index
once, and never re-snapshot an index if it appeared in the repository.
Then I learned more deeply about how snapshots worked and rewrote
Curator's snapshot functionality to what it is now. You can refer to
this issue on GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
if you like, where I explain how this all works with a thought experiment.
A direct quote:

When I initially coded snapshot behavior, I thought that snapshots would

pile up additional copies of indices, somehow, even though I knew they
supported incremental backups. What I learned is that snapshots capture at
the segment level. As long as the segment is referenced by any other
snapshot in the system, deleting a snapshot will not delete the referenced
data. Subsequent snapshots of the same indices simply back up any new
segments since the last snapshot in that repository (incrementals).

So to answer your question number 1, Elasticsearch does optimize for
this. Though curator references indices in a snapshot, only new,
not-yet-snapshotted segments are added to the new snapshot. It's
completely incremental.

The referenced GitHub issue
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
answers question number 2 by showing how keeping 60 days of snapshots would
allow you to restore the most recent 60 days of indices from only the
snapshot taken on day 60.

The answer to question 3 may be addressed by my suggested use-case for
time-series indices in that same GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
link.

I also responded to the thread you linked
https://groups.google.com/forum/#!searchin/elasticsearch/snapshot$20duration/elasticsearch/bCKenCVFf2o/TFK-Es0wxSwJ
:

Snapshots are at the segment level. The more segments stored in the

repository, the more segments will have to be compared to those in each
successive snapshot. With merges taking place continually in an active
index, you may end up with a considerable number of "orphaned" segments
stored in your repository, i.e. segments "backed up," but no longer
directly correlating to a segment in your index. Checking through these
may be contributing to the increased amount of time between snapshots.

Consider pruning older snapshots. "Orphaned" segments will be deleted,
and any segments still referenced will be preserved.

This does not affect you in the same way if you have time-series indices
(e.g. one per day). My suggested use-case, as mentioned previously, is to
have 2 repositories: one for frequent snapshots (before daily index
optimization [force merge to smaller number of segments per shard]), and
one for optimized daily snapshots. In this way you could keep hourly
snapshots of the last few daily indices in one repository, and daily
snapshots of your optimized indices in another. This prevents the
slow-down by reducing the number of segments the repositories must search
through for both hourly and daily snapshots.

--Aaron

On Tuesday, December 2, 2014 1:20:25 PM UTC-5, Matt Hughes wrote:

As noted here --
Redirecting to Google Groups
-- the time it takes to perform a snapshot increases the more snapshots you
take. This eventually can become untenable. So far, the only solution
seems to be either: trim snapshots or snapshot into a new repository,
resetting the performance.

  1. When I perform snapshots, I want to snapshot all indices. However,
    all of my indices are timestamped logstash-style. The only index that
    receives new documents is todays. I would think Elasticsearch could
    optimize for this and not look through all the snapshots if the index is
    older than today. If there was some mechanism to indicate an index was
    frozen (read-only), then snapshotting could be very fast. Query the
    'frozenTime' for all indices and only try to update the unfrozen
    snapshots.

  2. I can sort of solve the above problem by just snapshotting today's
    indices, but then restore is cumbersome. Say I retained 60 days worth of
    data; that means I'd have to retain 60 days worth of snapshots. And to do
    the restore, I'd have to restore all 60 snapshots. The situation gets
    worse if I wanted to snapshot multiple times a day.

  3. While not really fixing the crux of the problem, the curator script
    could help here. Right now, you can only trim snapshots --older-than some
    date. But what if there was a --thin option. Say I take snapshots every
    hour; I'm only really interested in that precision of snapshots for the
    past 24 hours. A backup from a month ago at 12pm is not much different to
    me than a month ago at 1pm. The proposed --thin option would look
    something like this:

curator snapshot --thin-older-than 1 --retain-copies 1.

This would delete all but the last snapshot for each day.

I'd love to hear thoughts on this and how people are currently solving
this problem in an automated way.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/Vy8B4YZizjQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e76eaba-97d5-4447-89d9-46cfb65ba6d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1417616419013.962c4e64%40Nodemailer.
For more options, visit https://groups.google.com/d/optout.

I understand that the segments are only backed up once. But anecdotally --
and this has been seen by others on the link I started out with --
snapshots take longer as time goes on. With time-based indexes, only
today's segments should be changing whether I optimize the old ones or
not.

If ES is smart enough to ignore the unchanged segments, then I wouldn't
expect to see the snapshot time grow so linearly. If I just snapshot
today's index it takes 10 seconds. If I snapshot all my indexes, 99% of
which have been optimized and haven't changed since the last time I
snapshotted them, it takes 12 minutes. Something seems wrong there. It
shouldn't take 11 minutes, 50 seconds to determine that the other indexes
haven't changed. I'm using S3 if that helps. Maybe S3 is slow? Maybe ES
reads more than it has to from the repository? Do indexes have some sort
of hash on them so you could easily and cheaply compare the index state vs
snapshot state?

Having the dual repositories does solve the problem as I am capping the
number of snapshots to a much more reasonable level and is the solution I'm
going forward with.

On Wednesday, December 3, 2014 9:20:30 AM UTC-5, Aaron Mildenstein wrote:

A “thin” option cannot help with snapshots as they reference segments. If
a segment exists in a time series index from 1 month ago at 12pm and hasn’t
changed by 1 month ago at 1pm, it is only stored once in the repository.
Though you have multiple “snapshots,” each segment is only ever backed up
once.

See immutability
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/making-text-searchable.html#_immutability to
understand why a segment, once backed up, doesn’t need to be re-backed up.

This is why, for time-series data, it makes sense to have 2 repositories.
One for “live” data where segments are constantly merging (and therefore
creating newer, bigger segments while deleting the older ones), and one for
“cold” data where the segments will not be changing any more (hopefully
you’ll have merged/optimized them to only a few segments per shard by this
point).

So, “thinning” is really only good for pruning snapshots filled with
rapidly changing segments (the “orphans” I referenced before), and that’s
only useful with the —older-than flag that is already there.

—Aaron

On Wed, Dec 3, 2014 at 8:57 AM, Matt Hughes <hughe...@gmail.com
<javascript:>> wrote:

Thanks for the speedy reply.

As for 1, I understand that ES optimizes for storage as snapshots of
the same index accumulate; I just wish it could also optimize for
performance. Right now, with a measly 4.5 gig cluster, the difference
between the snapshot 1 and snapshot 24 is 8 minutes. If I kept 15 days
worth of hourlies, I could see the snapshot time going past an hour.

Your suggestion of having two repositories is very helpful. Do you have
any interest in a PR for the --thin-older-than feature I proposed above? I
believe that would solve the same problem as the two repositories but only
require one. So say you wanted 15 days worth of dailys and 2 days worth of
hourlies, you could run something like this every hour:

curator snapshot --all-indices --repository backup
curator snapshot --delete-older-than 15 --repository backup
curator snapshot --thin-older-than 2 --repository backup

As described above, the --thin-older-than would remove all but N
snapshots per time-unit where N is defaulted to 1.

On Wednesday, December 3, 2014 7:35:49 AM UTC-5, Aaron Mildenstein wrote:

Hi! Sorry to keep you waiting, but I've been traveling. I completely
misunderstood how snapshots worked when they first came out, so when I
first wrote Curator's snapshot module it would only snap a complete index
once, and never re-snapshot an index if it appeared in the repository.
Then I learned more deeply about how snapshots worked and rewrote
Curator's snapshot functionality to what it is now. You can refer to
this issue on GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
if you like, where I explain how this all works with a thought experiment.
A direct quote:

When I initially coded snapshot behavior, I thought that snapshots

would pile up additional copies of indices, somehow, even though I knew
they supported incremental backups. What I learned is that snapshots
capture at the segment level. As long as the segment is referenced
by any other snapshot in the system, deleting a snapshot will not
delete the referenced data. Subsequent snapshots of the same indices simply
back up any new segments since the last snapshot in that repository
(incrementals).

So to answer your question number 1, Elasticsearch does optimize for
this. Though curator references indices in a snapshot, only new,
not-yet-snapshotted segments are added to the new snapshot. It's
completely incremental.

The referenced GitHub issue
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
answers question number 2 by showing how keeping 60 days of snapshots would
allow you to restore the most recent 60 days of indices from only the
snapshot taken on day 60.

The answer to question 3 may be addressed by my suggested use-case for
time-series indices in that same GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
link.

I also responded to the thread you linked
https://groups.google.com/forum/#!searchin/elasticsearch/snapshot%2420duration/elasticsearch/bCKenCVFf2o/TFK-Es0wxSwJ
:

Snapshots are at the segment level. The more segments stored in the

repository, the more segments will have to be compared to those in each
successive snapshot. With merges taking place continually in an active
index, you may end up with a considerable number of "orphaned" segments
stored in your repository, i.e. segments "backed up," but no longer
directly correlating to a segment in your index. Checking through these
may be contributing to the increased amount of time between snapshots.

Consider pruning older snapshots. "Orphaned" segments will be deleted,
and any segments still referenced will be preserved.

This does not affect you in the same way if you have time-series indices
(e.g. one per day). My suggested use-case, as mentioned previously, is to
have 2 repositories: one for frequent snapshots (before daily index
optimization [force merge to smaller number of segments per shard]), and
one for optimized daily snapshots. In this way you could keep hourly
snapshots of the last few daily indices in one repository, and daily
snapshots of your optimized indices in another. This prevents the
slow-down by reducing the number of segments the repositories must search
through for both hourly and daily snapshots.

--Aaron

On Tuesday, December 2, 2014 1:20:25 PM UTC-5, Matt Hughes wrote:

As noted here --
Redirecting to Google Groups
-- the time it takes to perform a snapshot increases the more snapshots you
take. This eventually can become untenable. So far, the only solution
seems to be either: trim snapshots or snapshot into a new repository,
resetting the performance.

  1. When I perform snapshots, I want to snapshot all indices. However,
    all of my indices are timestamped logstash-style. The only index that
    receives new documents is todays. I would think Elasticsearch could
    optimize for this and not look through all the snapshots if the index is
    older than today. If there was some mechanism to indicate an index was
    frozen (read-only), then snapshotting could be very fast. Query the
    'frozenTime' for all indices and only try to update the unfrozen
    snapshots.

  2. I can sort of solve the above problem by just snapshotting today's
    indices, but then restore is cumbersome. Say I retained 60 days worth of
    data; that means I'd have to retain 60 days worth of snapshots. And to do
    the restore, I'd have to restore all 60 snapshots. The situation gets
    worse if I wanted to snapshot multiple times a day.

  3. While not really fixing the crux of the problem, the curator script
    could help here. Right now, you can only trim snapshots --older-than some
    date. But what if there was a --thin option. Say I take snapshots every
    hour; I'm only really interested in that precision of snapshots for the
    past 24 hours. A backup from a month ago at 12pm is not much different to
    me than a month ago at 1pm. The proposed --thin option would look
    something like this:

curator snapshot --thin-older-than 1 --retain-copies 1.

This would delete all but the last snapshot for each day.

I'd love to hear thoughts on this and how people are currently solving
this problem in an automated way.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Vy8B4YZizjQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8e76eaba-97d5-4447-89d9-46cfb65ba6d0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8e76eaba-97d5-4447-89d9-46cfb65ba6d0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e027c659-9aa5-427b-b1dc-f1fcb723b1ba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Time-series indices can grow to 300 segments per index or more. 30 days of that is a rather large number of segments to test, especially over TCP/IP to Amazon S3. It tests before it can ignore.

—Aaron

On Wed, Dec 3, 2014 at 10:24 AM, Matt Hughes hughes.matt@gmail.com
wrote:

I understand that the segments are only backed up once. But anecdotally --
and this has been seen by others on the link I started out with --
snapshots take longer as time goes on. With time-based indexes, only
today's segments should be changing whether I optimize the old ones or
not.
If ES is smart enough to ignore the unchanged segments, then I wouldn't
expect to see the snapshot time grow so linearly. If I just snapshot
today's index it takes 10 seconds. If I snapshot all my indexes, 99% of
which have been optimized and haven't changed since the last time I
snapshotted them, it takes 12 minutes. Something seems wrong there. It
shouldn't take 11 minutes, 50 seconds to determine that the other indexes
haven't changed. I'm using S3 if that helps. Maybe S3 is slow? Maybe ES
reads more than it has to from the repository? Do indexes have some sort
of hash on them so you could easily and cheaply compare the index state vs
snapshot state?
Having the dual repositories does solve the problem as I am capping the
number of snapshots to a much more reasonable level and is the solution I'm
going forward with.
On Wednesday, December 3, 2014 9:20:30 AM UTC-5, Aaron Mildenstein wrote:

A “thin” option cannot help with snapshots as they reference segments. If
a segment exists in a time series index from 1 month ago at 12pm and hasn’t
changed by 1 month ago at 1pm, it is only stored once in the repository.
Though you have multiple “snapshots,” each segment is only ever backed up
once.

See immutability
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/making-text-searchable.html#_immutability to
understand why a segment, once backed up, doesn’t need to be re-backed up.

This is why, for time-series data, it makes sense to have 2 repositories.
One for “live” data where segments are constantly merging (and therefore
creating newer, bigger segments while deleting the older ones), and one for
“cold” data where the segments will not be changing any more (hopefully
you’ll have merged/optimized them to only a few segments per shard by this
point).

So, “thinning” is really only good for pruning snapshots filled with
rapidly changing segments (the “orphans” I referenced before), and that’s
only useful with the —older-than flag that is already there.

—Aaron

On Wed, Dec 3, 2014 at 8:57 AM, Matt Hughes <hughe...@gmail.com
<javascript:>> wrote:

Thanks for the speedy reply.

As for 1, I understand that ES optimizes for storage as snapshots of
the same index accumulate; I just wish it could also optimize for
performance. Right now, with a measly 4.5 gig cluster, the difference
between the snapshot 1 and snapshot 24 is 8 minutes. If I kept 15 days
worth of hourlies, I could see the snapshot time going past an hour.

Your suggestion of having two repositories is very helpful. Do you have
any interest in a PR for the --thin-older-than feature I proposed above? I
believe that would solve the same problem as the two repositories but only
require one. So say you wanted 15 days worth of dailys and 2 days worth of
hourlies, you could run something like this every hour:

curator snapshot --all-indices --repository backup
curator snapshot --delete-older-than 15 --repository backup
curator snapshot --thin-older-than 2 --repository backup

As described above, the --thin-older-than would remove all but N
snapshots per time-unit where N is defaulted to 1.

On Wednesday, December 3, 2014 7:35:49 AM UTC-5, Aaron Mildenstein wrote:

Hi! Sorry to keep you waiting, but I've been traveling. I completely
misunderstood how snapshots worked when they first came out, so when I
first wrote Curator's snapshot module it would only snap a complete index
once, and never re-snapshot an index if it appeared in the repository.
Then I learned more deeply about how snapshots worked and rewrote
Curator's snapshot functionality to what it is now. You can refer to
this issue on GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
if you like, where I explain how this all works with a thought experiment.
A direct quote:

When I initially coded snapshot behavior, I thought that snapshots

would pile up additional copies of indices, somehow, even though I knew
they supported incremental backups. What I learned is that snapshots
capture at the segment level. As long as the segment is referenced
by any other snapshot in the system, deleting a snapshot will not
delete the referenced data. Subsequent snapshots of the same indices simply
back up any new segments since the last snapshot in that repository
(incrementals).

So to answer your question number 1, Elasticsearch does optimize for
this. Though curator references indices in a snapshot, only new,
not-yet-snapshotted segments are added to the new snapshot. It's
completely incremental.

The referenced GitHub issue
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
answers question number 2 by showing how keeping 60 days of snapshots would
allow you to restore the most recent 60 days of indices from only the
snapshot taken on day 60.

The answer to question 3 may be addressed by my suggested use-case for
time-series indices in that same GitHub
https://github.com/elasticsearch/curator/issues/174#issuecomment-57056621
link.

I also responded to the thread you linked
https://groups.google.com/forum/#!searchin/elasticsearch/snapshot%2420duration/elasticsearch/bCKenCVFf2o/TFK-Es0wxSwJ
:

Snapshots are at the segment level. The more segments stored in the

repository, the more segments will have to be compared to those in each
successive snapshot. With merges taking place continually in an active
index, you may end up with a considerable number of "orphaned" segments
stored in your repository, i.e. segments "backed up," but no longer
directly correlating to a segment in your index. Checking through these
may be contributing to the increased amount of time between snapshots.

Consider pruning older snapshots. "Orphaned" segments will be deleted,
and any segments still referenced will be preserved.

This does not affect you in the same way if you have time-series indices
(e.g. one per day). My suggested use-case, as mentioned previously, is to
have 2 repositories: one for frequent snapshots (before daily index
optimization [force merge to smaller number of segments per shard]), and
one for optimized daily snapshots. In this way you could keep hourly
snapshots of the last few daily indices in one repository, and daily
snapshots of your optimized indices in another. This prevents the
slow-down by reducing the number of segments the repositories must search
through for both hourly and daily snapshots.

--Aaron

On Tuesday, December 2, 2014 1:20:25 PM UTC-5, Matt Hughes wrote:

As noted here --
Redirecting to Google Groups
-- the time it takes to perform a snapshot increases the more snapshots you
take. This eventually can become untenable. So far, the only solution
seems to be either: trim snapshots or snapshot into a new repository,
resetting the performance.

  1. When I perform snapshots, I want to snapshot all indices. However,
    all of my indices are timestamped logstash-style. The only index that
    receives new documents is todays. I would think Elasticsearch could
    optimize for this and not look through all the snapshots if the index is
    older than today. If there was some mechanism to indicate an index was
    frozen (read-only), then snapshotting could be very fast. Query the
    'frozenTime' for all indices and only try to update the unfrozen
    snapshots.

  2. I can sort of solve the above problem by just snapshotting today's
    indices, but then restore is cumbersome. Say I retained 60 days worth of
    data; that means I'd have to retain 60 days worth of snapshots. And to do
    the restore, I'd have to restore all 60 snapshots. The situation gets
    worse if I wanted to snapshot multiple times a day.

  3. While not really fixing the crux of the problem, the curator script
    could help here. Right now, you can only trim snapshots --older-than some
    date. But what if there was a --thin option. Say I take snapshots every
    hour; I'm only really interested in that precision of snapshots for the
    past 24 hours. A backup from a month ago at 12pm is not much different to
    me than a month ago at 1pm. The proposed --thin option would look
    something like this:

curator snapshot --thin-older-than 1 --retain-copies 1.

This would delete all but the last snapshot for each day.

I'd love to hear thoughts on this and how people are currently solving
this problem in an automated way.

--
You received this message because you are subscribed to a topic in the
Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit
https://groups.google.com/d/topic/elasticsearch/Vy8B4YZizjQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to
elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/8e76eaba-97d5-4447-89d9-46cfb65ba6d0%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/8e76eaba-97d5-4447-89d9-46cfb65ba6d0%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/Vy8B4YZizjQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e027c659-9aa5-427b-b1dc-f1fcb723b1ba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1417627250388.1a0c3ba1%40Nodemailer.
For more options, visit https://groups.google.com/d/optout.

Right, but it’s still slow after optimizing each day to 2 segments. I actually noticed no difference in snapshot speed pre/post optimizing old indexes for what it’s worth.

On December 3, 2014 at 12:20:59 PM, Aaron Mildenstein (aaron@mildensteins.com) wrote:

Time-series indices can grow to 300 segments per index or more. 30 days of that is a rather large number of segments to test, especially over TCP/IP to Amazon S3. It tests before it can ignore.

—Aaron

On Wed, Dec 3, 2014 at 10:24 AM, Matt Hughes hughes.matt@gmail.com wrote:

I understand that the segments are only backed up once. But anecdotally -- and this has been seen by others on the link I started out with -- snapshots take longer as time goes on. With time-based indexes, only today's segments should be changing whether I optimize the old ones or not.

If ES is smart enough to ignore the unchanged segments, then I wouldn't expect to see the snapshot time grow so linearly. If I just snapshot today's index it takes 10 seconds. If I snapshot all my indexes, 99% of which have been optimized and haven't changed since the last time I snapshotted them, it takes 12 minutes. Something seems wrong there. It shouldn't take 11 minutes, 50 seconds to determine that the other indexes haven't changed. I'm using S3 if that helps. Maybe S3 is slow? Maybe ES reads more than it has to from the repository? Do indexes have some sort of hash on them so you could easily and cheaply compare the index state vs snapshot state?

Having the dual repositories does solve the problem as I am capping the number of snapshots to a much more reasonable level and is the solution I'm going forward with.

On Wednesday, December 3, 2014 9:20:30 AM UTC-5, Aaron Mildenstein wrote:
A “thin” option cannot help with snapshots as they reference segments. If a segment exists in a time series index from 1 month ago at 12pm and hasn’t changed by 1 month ago at 1pm, it is only stored once in the repository. Though you have multiple “snapshots,” each segment is only ever backed up once.

See immutability to understand why a segment, once backed up, doesn’t need to be re-backed up.

This is why, for time-series data, it makes sense to have 2 repositories. One for “live” data where segments are constantly merging (and therefore creating newer, bigger segments while deleting the older ones), and one for “cold” data where the segments will not be changing any more (hopefully you’ll have merged/optimized them to only a few segments per shard by this point).

So, “thinning” is really only good for pruning snapshots filled with rapidly changing segments (the “orphans” I referenced before), and that’s only useful with the —older-than flag that is already there.

—Aaron

On Wed, Dec 3, 2014 at 8:57 AM, Matt Hughes hughe...@gmail.com wrote:

Thanks for the speedy reply.

As for 1, I understand that ES optimizes for storage as snapshots of the same index accumulate; I just wish it could also optimize for performance. Right now, with a measly 4.5 gig cluster, the difference between the snapshot 1 and snapshot 24 is 8 minutes. If I kept 15 days worth of hourlies, I could see the snapshot time going past an hour.

Your suggestion of having two repositories is very helpful. Do you have any interest in a PR for the --thin-older-than feature I proposed above? I believe that would solve the same problem as the two repositories but only require one. So say you wanted 15 days worth of dailys and 2 days worth of hourlies, you could run something like this every hour:

curator snapshot --all-indices --repository backup
curator snapshot --delete-older-than 15 --repository backup
curator snapshot --thin-older-than 2 --repository backup

As described above, the --thin-older-than would remove all but N snapshots per time-unit where N is defaulted to 1.

On Wednesday, December 3, 2014 7:35:49 AM UTC-5, Aaron Mildenstein wrote:
Hi! Sorry to keep you waiting, but I've been traveling. I completely misunderstood how snapshots worked when they first came out, so when I first wrote Curator's snapshot module it would only snap a complete index once, and never re-snapshot an index if it appeared in the repository. Then I learned more deeply about how snapshots worked and rewrote Curator's snapshot functionality to what it is now. You can refer to this issue on GitHub if you like, where I explain how this all works with a thought experiment. A direct quote:

When I initially coded snapshot behavior, I thought that snapshots would pile up additional copies of indices, somehow, even though I knew they supported incremental backups. What I learned is that snapshots capture at the segment level. As long as the segment is referenced by any other snapshot in the system, deleting a snapshot will not delete the referenced data. Subsequent snapshots of the same indices simply back up any new segments since the last snapshot in that repository (incrementals).

So to answer your question number 1, Elasticsearch does optimize for this. Though curator references indices in a snapshot, only new, not-yet-snapshotted segments are added to the new snapshot. It's completely incremental.

The referenced GitHub issue answers question number 2 by showing how keeping 60 days of snapshots would allow you to restore the most recent 60 days of indices from only the snapshot taken on day 60.

The answer to question 3 may be addressed by my suggested use-case for time-series indices in that same GitHub link.

I also responded to the thread you linked:

Snapshots are at the segment level. The more segments stored in the repository, the more segments will have to be compared to those in each successive snapshot. With merges taking place continually in an active index, you may end up with a considerable number of "orphaned" segments stored in your repository, i.e. segments "backed up," but no longer directly correlating to a segment in your index. Checking through these may be contributing to the increased amount of time between snapshots.

Consider pruning older snapshots. "Orphaned" segments will be deleted, and any segments still referenced will be preserved.

This does not affect you in the same way if you have time-series indices (e.g. one per day). My suggested use-case, as mentioned previously, is to have 2 repositories: one for frequent snapshots (before daily index optimization [force merge to smaller number of segments per shard]), and one for optimized daily snapshots. In this way you could keep hourly snapshots of the last few daily indices in one repository, and daily snapshots of your optimized indices in another. This prevents the slow-down by reducing the number of segments the repositories must search through for both hourly and daily snapshots.

--Aaron

On Tuesday, December 2, 2014 1:20:25 PM UTC-5, Matt Hughes wrote:
As noted here -- https://groups.google.com/forum/#!searchin/elasticsearch/snapshot$20duration/elasticsearch/bCKenCVFf2o/TFK-Es0wxSwJ -- the time it takes to perform a snapshot increases the more snapshots you take. This eventually can become untenable. So far, the only solution seems to be either: trim snapshots or snapshot into a new repository, resetting the performance.

  1. When I perform snapshots, I want to snapshot all indices. However, all of my indices are timestamped logstash-style. The only index that receives new documents is todays. I would think Elasticsearch could optimize for this and not look through all the snapshots if the index is older than today. If there was some mechanism to indicate an index was frozen (read-only), then snapshotting could be very fast. Query the 'frozenTime' for all indices and only try to update the unfrozen snapshots.

  2. I can sort of solve the above problem by just snapshotting today's indices, but then restore is cumbersome. Say I retained 60 days worth of data; that means I'd have to retain 60 days worth of snapshots. And to do the restore, I'd have to restore all 60 snapshots. The situation gets worse if I wanted to snapshot multiple times a day.

  3. While not really fixing the crux of the problem, the curator script could help here. Right now, you can only trim snapshots --older-than some date. But what if there was a --thin option. Say I take snapshots every hour; I'm only really interested in that precision of snapshots for the past 24 hours. A backup from a month ago at 12pm is not much different to me than a month ago at 1pm. The proposed --thin option would look something like this:

curator snapshot --thin-older-than 1 --retain-copies 1.

This would delete all but the last snapshot for each day.

I'd love to hear thoughts on this and how people are currently solving this problem in an automated way.

You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/Vy8B4YZizjQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8e76eaba-97d5-4447-89d9-46cfb65ba6d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/Vy8B4YZizjQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e027c659-9aa5-427b-b1dc-f1fcb723b1ba%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to a topic in the Google Groups "elasticsearch" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/Vy8B4YZizjQ/unsubscribe.
To unsubscribe from this group and all its topics, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1417627250388.1a0c3ba1%40Nodemailer.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/etPan.547f46d3.41b71efb.174f%40Matthews-MacBook-Pro.local.
For more options, visit https://groups.google.com/d/optout.