How do you run ES with limited data storage space?


(David Reagan) #1

So, I haven't figured out the right search terms to find the answer via
Google yet, I've read a lot of the docs on the subject of Snapshot and
Restore without finding an answer, and I haven't had the time or resources
to test some of my own ideas. Hence, I'm posting this in the hopes that
someone who has already solved this problem will share.

How do you run ES with limited data storage space?

Basically, short of getting more space, what can I do to make the best use
of what I have, and still meet as many of my goals as possible?

My setup is 4 data nodes. Due to lack of resources/money, they are all thin
provisioned VMs, and all my data has to be on NFS/SAN mounts. Storing data
on the actual VM's hard disk would negatively effect other VMs and services.

Our NFS SAN is also low on space. So I only have about 1.5TB to use.
Initially this seemed like plenty, but a couple weeks ago, ES started
complaining about running out of space. Usage on that mount was over 80%.
My snapshot repository had ballooned to over 700GB, and each node's data
mount point was around 150GB.

Currently, I'm only using ES for logs.

For day to day use, I should be fine with 1 month of open indices. Thus,
I've been keeping older indices closed already. So I can't really do much
more when it comes to closing indices.

I also run the optimize command nightly on any logstash index older that a
couple days.

I'd just delete the really old data, but I have use cases for data up to
1.5 years old. Considering that snapshots of only a few months nearly used
up all my space, and how much space a month of logs is currently taking up,
I'm not sure how I can store that much data.

So, in general, how would you solve my problem? I need to have immediate
access to 1 months worth of logs (via Kibana), be able to relatively
quickly access up to 6 months of logs (open closed indices?), and access up
to 1.5 years worth temporarily (restore snapshots to new cluster on my
desktop?)

Would there be a way to move snapshots off of the NFS SAN to an external
hard drive?

Should I tell logstash to send logs to a text file that get's logrotated
for a year and a half? Or does ES do a good enough job with compression
that gzipping wouldn't help? If it was just a text file, I could unzip it,
then tell Logstash to read the file into an ES cluster.

ES already compresses stored indices by default, right? So there's nothing
I can do there?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2

There's not a lot you can do here unless you want to start uploading
snapshots to S3, or something else that is not on your NAS.
ES does compress by default and we are working on using a better algorithm
for future releases which will help, but there's no ETA for that.

On 16 March 2015 at 17:29, David Reagan jerrac@gmail.com wrote:

So, I haven't figured out the right search terms to find the answer via
Google yet, I've read a lot of the docs on the subject of Snapshot and
Restore without finding an answer, and I haven't had the time or resources
to test some of my own ideas. Hence, I'm posting this in the hopes that
someone who has already solved this problem will share.

How do you run ES with limited data storage space?

Basically, short of getting more space, what can I do to make the best use
of what I have, and still meet as many of my goals as possible?

My setup is 4 data nodes. Due to lack of resources/money, they are all
thin provisioned VMs, and all my data has to be on NFS/SAN mounts. Storing
data on the actual VM's hard disk would negatively effect other VMs and
services.

Our NFS SAN is also low on space. So I only have about 1.5TB to use.
Initially this seemed like plenty, but a couple weeks ago, ES started
complaining about running out of space. Usage on that mount was over 80%.
My snapshot repository had ballooned to over 700GB, and each node's data
mount point was around 150GB.

Currently, I'm only using ES for logs.

For day to day use, I should be fine with 1 month of open indices. Thus,
I've been keeping older indices closed already. So I can't really do much
more when it comes to closing indices.

I also run the optimize command nightly on any logstash index older that a
couple days.

I'd just delete the really old data, but I have use cases for data up to
1.5 years old. Considering that snapshots of only a few months nearly used
up all my space, and how much space a month of logs is currently taking up,
I'm not sure how I can store that much data.

So, in general, how would you solve my problem? I need to have immediate
access to 1 months worth of logs (via Kibana), be able to relatively
quickly access up to 6 months of logs (open closed indices?), and access up
to 1.5 years worth temporarily (restore snapshots to new cluster on my
desktop?)

Would there be a way to move snapshots off of the NFS SAN to an external
hard drive?

Should I tell logstash to send logs to a text file that get's logrotated
for a year and a half? Or does ES do a good enough job with compression
that gzipping wouldn't help? If it was just a text file, I could unzip it,
then tell Logstash to read the file into an ES cluster.

ES already compresses stored indices by default, right? So there's nothing
I can do there?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9DZhYKszmDRrssh%3DiNb6UAJ8EU6eHGPN-OPaORxmvM2w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Aaron Mefford) #3

While ES does compress by default, it also stores data in data structures,
that increase the size of the data. The net is that your data will be much
larger than the equivalent log file gzipped. However, running logstash to
ingest 1.5 years of logs may well take much longer than you would expect.

There is no reason you shouldn't be able to move snapshots off of your
shared drive onto an external drive or other storage, such as S3.

One thing you should reconsider is what you are trying to do with your
resources. It sounds like it is simply too much. If the budget cannot
budge to accommodate the requirements, then the requirements must budge to
accommodate the budget. Perhaps you can identify some log sources that do
not have the same retention requirements. Perhaps it is some segment of
your logs that is not as important. For instance is it really important to
keep that Java Stack trace from a year ago? Now I don't know the nature of
your logs, but I do know the nature of logs, and there are important log
entries, and there are mundane repetitive entries. What I am driving at is
that leveraging the ability of using ES aliasing and cross index searching
you can segment your logs into important indexes and not important. You
can still search across all the indexes, but you can establish retention
policies which differ for the less important, while preserving the precious
resources you have for the important.

Some data you can take an RRD style approach with and create indexes that
have summary information in them which will allow you to generate
historical dashboards that still capture the essence of the day, if not the
detail. For instance while you could not show the individual requests on a
given day, you could still show the request volume over a three year period.

While this goes against the nature of the e logging efforts, these are some
of the ideas I had while reading about your situation.

Aaron

On Monday, March 16, 2015 at 6:42:43 PM UTC-6, Mark Walkom wrote:

There's not a lot you can do here unless you want to start uploading
snapshots to S3, or something else that is not on your NAS.
ES does compress by default and we are working on using a better algorithm
for future releases which will help, but there's no ETA for that.

On 16 March 2015 at 17:29, David Reagan <jer...@gmail.com <javascript:>>
wrote:

So, I haven't figured out the right search terms to find the answer via
Google yet, I've read a lot of the docs on the subject of Snapshot and
Restore without finding an answer, and I haven't had the time or resources
to test some of my own ideas. Hence, I'm posting this in the hopes that
someone who has already solved this problem will share.

How do you run ES with limited data storage space?

Basically, short of getting more space, what can I do to make the best
use of what I have, and still meet as many of my goals as possible?

My setup is 4 data nodes. Due to lack of resources/money, they are all
thin provisioned VMs, and all my data has to be on NFS/SAN mounts. Storing
data on the actual VM's hard disk would negatively effect other VMs and
services.

Our NFS SAN is also low on space. So I only have about 1.5TB to use.
Initially this seemed like plenty, but a couple weeks ago, ES started
complaining about running out of space. Usage on that mount was over 80%.
My snapshot repository had ballooned to over 700GB, and each node's data
mount point was around 150GB.

Currently, I'm only using ES for logs.

For day to day use, I should be fine with 1 month of open indices. Thus,
I've been keeping older indices closed already. So I can't really do much
more when it comes to closing indices.

I also run the optimize command nightly on any logstash index older that
a couple days.

I'd just delete the really old data, but I have use cases for data up to
1.5 years old. Considering that snapshots of only a few months nearly used
up all my space, and how much space a month of logs is currently taking up,
I'm not sure how I can store that much data.

So, in general, how would you solve my problem? I need to have immediate
access to 1 months worth of logs (via Kibana), be able to relatively
quickly access up to 6 months of logs (open closed indices?), and access up
to 1.5 years worth temporarily (restore snapshots to new cluster on my
desktop?)

Would there be a way to move snapshots off of the NFS SAN to an external
hard drive?

Should I tell logstash to send logs to a text file that get's logrotated
for a year and a half? Or does ES do a good enough job with compression
that gzipping wouldn't help? If it was just a text file, I could unzip it,
then tell Logstash to read the file into an ES cluster.

ES already compresses stored indices by default, right? So there's
nothing I can do there?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/b694768f-3c71-4b98-a18c-842c95809734%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/db9d04c7-70d5-4810-899d-bc025c01ec21%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(David Reagan) #4

So, just in case someone finds it useful, here's what I ended up doing.

I take daily snapshots of each logstash index. Then, when I need more room, I create a tgz file of that day's snapshot, and archive it elsewhere. That lets me delete both the indices and the snapshot.

I save the metadata- and snapshot- files, then read which indices are in the snapshot, and find those directories in the repo/indices dir and add them to the tgz file.

If I need to access the data later, I can restore it into a temporary clusters repository and restore it from there into the temp cluster. Or just do the same in the prod cluster.

Of course, it will be interesting to see how well this works when the snapshot is from 1.x and ES is at 2.x...

And I haven't been doing this long enough to see if it really does what I want. We'll see.

Also, as @Aaron_Mefford said, it'd be better to adjust my requirements to fit my budget. Or the other way around. His other suggestion of removing specific kinds of data is also good. Though I'm going to try and avoid implementing it if possible, it seems complicated... :smile:


(system) #5