Reducing Disk Space Requirements/ Deduplication? Zipping?


(Horst Birne) #1

Hey Guys,

First of all our Setup of Elastisearch:

  • 1 Node
  • 16 GB Ram
  • 4 CPU
  • Version 0.9.7
  • 5 Shards , 1 Replica
  • Type of Logs: WinEvent-Logs, Unix-System Logs, Cisco-Device-Logs,
    Firewall-Logs etc.
  • About 3 Million Logs per day

Using Logasth to collect Logs and Kibana to access it.

Today we started inserting our Netflow into Elasticsearch. In Fact we have
a big Production Environment so what we got were about 25000 Logs per
Second inserting into Elasticsearch.

It was no Problem for the System to manage this much Load but the Index
grows pretty fast and after 1 hour of testing we got 800 MB of Data(This
would be 19.2 GB of Data per Day and with a Log retention of 30 Day 576 GB
of Data)

Because this much Data is unacceptable for our System i really would like
to have ways to reduce the Disk Space Requirements.

I´ve tried reducing the Disk Space by using the Compression Method inbuilt
in Elasticsearch, setting _source to compress. Unfortunate this doesnt
helped much.

Also tried to use the _optimze command since someone wrote this would help
reducing the Disk Space - Had no effect.

The Goal is to reduce the 576 GB of Data to sth about 80-100 GB.

The First Thing what i could do is to reduce the Number of Shards to 2
which would reduce the Space Storage to about 220 GB. But i really doesnt
want to do that in case we add more nodes to the System.

The Next Thing i thinked about was adding a Deduplication File System to
the ES-Node, but i dont think that De-Dup has much effect on a ES-Index -
Any Experience in using that?

The last and the most obvious Thing is to Zip the Indices to Tarball or
.zip. I think thats our Solution for Long-Term Storage (up to 2 Years) but
its no solution for active Indices(The 30 days) since they would not be
searchable by Kibana.

So any of you Guys have Suggestion for us?

Cheers

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5e95978-cadd-4953-98b1-52af9a8c84ce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Mark Walkom) #2

Reducing shards won't help, you will still have the same amount of data,
it's just won't be sharded as much.
It's possible that using a dedupe FS will help as logs data can be
repetitive, but you'd really have to try it to see.
Zipping is an option, you could close older indexes and then zip them, and
doing the reverse to be able to read them when you want. However that adds
a lot of complexity and delay, which you might be ok with.

Ultimately, you won't get that sort of compression factor using the inbuilt
ES functionality, we saw about a 30-50% reduction in size when we enabled
compression in 0.90.N, but 70%+ is a big ask. Your best option would
probably be to not store the _source.

Regards,
Mark Walkom

Infrastructure Engineer
Campaign Monitor
email: markw@campaignmonitor.com
web: www.campaignmonitor.com

On 22 April 2014 18:25, horst knete baduncle23@hotmail.de wrote:

Hey Guys,

First of all our Setup of Elastisearch:

  • 1 Node
  • 16 GB Ram
  • 4 CPU
  • Version 0.9.7
  • 5 Shards , 1 Replica
  • Type of Logs: WinEvent-Logs, Unix-System Logs, Cisco-Device-Logs,
    Firewall-Logs etc.
  • About 3 Million Logs per day

Using Logasth to collect Logs and Kibana to access it.

Today we started inserting our Netflow into Elasticsearch. In Fact we have
a big Production Environment so what we got were about 25000 Logs per
Second inserting into Elasticsearch.

It was no Problem for the System to manage this much Load but the Index
grows pretty fast and after 1 hour of testing we got 800 MB of Data(This
would be 19.2 GB of Data per Day and with a Log retention of 30 Day 576 GB
of Data)

Because this much Data is unacceptable for our System i really would like
to have ways to reduce the Disk Space Requirements.

I´ve tried reducing the Disk Space by using the Compression Method inbuilt
in Elasticsearch, setting _source to compress. Unfortunate this doesnt
helped much.

Also tried to use the _optimze command since someone wrote this would help
reducing the Disk Space - Had no effect.

The Goal is to reduce the 576 GB of Data to sth about 80-100 GB.

The First Thing what i could do is to reduce the Number of Shards to 2
which would reduce the Space Storage to about 220 GB. But i really doesnt
want to do that in case we add more nodes to the System.

The Next Thing i thinked about was adding a Deduplication File System to
the ES-Node, but i dont think that De-Dup has much effect on a ES-Index -
Any Experience in using that?

The last and the most obvious Thing is to Zip the Indices to Tarball or
.zip. I think thats our Solution for Long-Term Storage (up to 2 Years) but
its no solution for active Indices(The 30 days) since they would not be
searchable by Kibana.

So any of you Guys have Suggestion for us?

Cheers

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5e95978-cadd-4953-98b1-52af9a8c84ce%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/a5e95978-cadd-4953-98b1-52af9a8c84ce%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aSuNvrgq6zAFtf5Xv4yvodj%3DyFooiCPpP-iyMVOOT-gA%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Jörg Prante) #3

Did you calculate 25000 logs per second volume?

Given an estimated average log entry size of 100 bytes:

25.000 logs/sec
= 2.500.000 bytes/sec
= 9.000.000.000 bytes/hour
= 216.000.000.000 bytes/day
= 6.480.000.000.000 bytes/month
= 6.328.125.000 KB/month
= 6.179.809,5703125 MB/month
= 6.034,97028350830078 GB/month
= 5,89352566748857 TB/month

You can expect 6 TB input data volume. So 576 GB index size a month is very
small.

Active ES indices are LZF compressed by default. There is no possibility to
reduce them more siginificantly except reducing input size, reducing
replica, or by using special mappings.

Special mappings may help in case you want to drop some Kibana features.
E.g. you could use keyword analyzer for all strings, throw away field
norms, term vectors, etc. plus disabling _all and _source. This saves some
space but the price to pay is less "searchability" - careful testing is
required if your search requirements are still met. And I am sure you will
not reach 80-100GB a month.

So my personal recommendation is: You should always plan with TB storage
for Elasticsearch indices for log analysis applications.

Jörg

On Tue, Apr 22, 2014 at 10:25 AM, horst knete baduncle23@hotmail.de wrote:

Hey Guys,

First of all our Setup of Elastisearch:

  • 1 Node
  • 16 GB Ram
  • 4 CPU
  • Version 0.9.7
  • 5 Shards , 1 Replica
  • Type of Logs: WinEvent-Logs, Unix-System Logs, Cisco-Device-Logs,
    Firewall-Logs etc.
  • About 3 Million Logs per day

Using Logasth to collect Logs and Kibana to access it.

Today we started inserting our Netflow into Elasticsearch. In Fact we have
a big Production Environment so what we got were about 25000 Logs per
Second inserting into Elasticsearch.

It was no Problem for the System to manage this much Load but the Index
grows pretty fast and after 1 hour of testing we got 800 MB of Data(This
would be 19.2 GB of Data per Day and with a Log retention of 30 Day 576 GB
of Data)

Because this much Data is unacceptable for our System i really would like
to have ways to reduce the Disk Space Requirements.

I´ve tried reducing the Disk Space by using the Compression Method inbuilt
in Elasticsearch, setting _source to compress. Unfortunate this doesnt
helped much.

Also tried to use the _optimze command since someone wrote this would help
reducing the Disk Space - Had no effect.

The Goal is to reduce the 576 GB of Data to sth about 80-100 GB.

The First Thing what i could do is to reduce the Number of Shards to 2
which would reduce the Space Storage to about 220 GB. But i really doesnt
want to do that in case we add more nodes to the System.

The Next Thing i thinked about was adding a Deduplication File System to
the ES-Node, but i dont think that De-Dup has much effect on a ES-Index -
Any Experience in using that?

The last and the most obvious Thing is to Zip the Indices to Tarball or
.zip. I think thats our Solution for Long-Term Storage (up to 2 Years) but
its no solution for active Indices(The 30 days) since they would not be
searchable by Kibana.

So any of you Guys have Suggestion for us?

Cheers

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a5e95978-cadd-4953-98b1-52af9a8c84ce%40googlegroups.comhttps://groups.google.com/d/msgid/elasticsearch/a5e95978-cadd-4953-98b1-52af9a8c84ce%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEsB8QO6iEN3tDe%3DXwR_vN1xNgHAvwvbSxqne11Zn9Y%2BQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


(Horst Birne) #4

Hi,

thanks for your quick response.

@Jörg: The 576 GB i calculated was from the results we got as we tested the
input of the netflow data ( we tested it 15 Minutes and got about 200MB of
Data ).

Regarding to your answers i will try adjust the mapping as best as possible
(i think disabling _source and _all will do a good job) and see how it
impacts on Kibana.

I will update you in this thread how much Space were saved with this new
Settings.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1028fd65-56c6-4116-b1f6-d099673dfca7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(John Arnold (GNS)) #5

Since netflow data is not text, you're using elasticsearch like a
distributed "SQL" database. You should turn of analysis for all of the
netflow fields within your ES template, and drop any fields you don't
REALLY really need.

Also, consider using pmacct (pmacct.net) as a pre-aggregator and
"shipper", if you aggregate to something like every 1min you may get the
compression you're looking for and still have good data granularity.

Also, 1/2TB per month is pretty cheap... disk is cheap man. CPU and mem
is expensive, and you're burning a helluva lot of that at 25k netflow
packets/sec...

On Tuesday, April 22, 2014 4:26:14 AM UTC-7, horst knete wrote:

Hi,

thanks for your quick response.

@Jörg: The 576 GB i calculated was from the results we got as we tested
the input of the netflow data ( we tested it 15 Minutes and got about 200MB
of Data ).

Regarding to your answers i will try adjust the mapping as best as
possible (i think disabling _source and _all will do a good job) and see
how it impacts on Kibana.

I will update you in this thread how much Space were saved with this new
Settings.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/404b474c-69d0-4dc4-aed9-7574c3082fde%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #6