Is Elasticsearch capable of storing this amount of data?

Hi there,

I'm currently researching the possibilities of the usage of Elasticsearch
for storing log information from a complex back-end. The plan is to track
certain packages through different processes en we therefore need to store
a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11 billion
logs/month. For each day a index will be made and after each month the data
will be archived/deleted (we gather statistics and store this for a
overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on and
sketch a possible cluster specification where this data could be stored.
I'm very curious about peoples opinions on storing this amount of data in
Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these two
are totally different database solutions but both are a possible fit for
our use-case (though the flexibility of Elasticsearch has the edge right
now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello Vincent,

This looks doable. Of course, it depends on the hardware and what sort of
queries and latencies you expect, but nothing in your email makes me think
this could not be done.

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, March 11, 2013 12:26:40 PM UTC-4, Vincent wrote:

Hi there,

I'm currently researching the possibilities of the usage of Elasticsearch
for storing log information from a complex back-end. The plan is to track
certain packages through different processes en we therefore need to store
a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on and
sketch a possible cluster specification where this data could be stored.
I'm very curious about peoples opinions on storing this amount of data in
Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Otis,

thanks for your response.

The queries would be pretty straightforward: default searches on fields and
some date-range selections (last 24 hour for example), but this shouldn't
be a problem with daily indexes. The most important issue would be the
throughput (it should be able to write/store +4000 logs/sec).

I'm going to setup a cluster of 4 servers this week and run some tests with
it (configuring different shard/replica multiples). I'm pretty curious
about the results :slight_smile:

Regards,
Vincent

Op maandag 11 maart 2013 19:28:02 UTC+1 schreef Otis Gospodnetic het
volgende:

Hello Vincent,

This looks doable. Of course, it depends on the hardware and what sort of
queries and latencies you expect, but nothing in your email makes me think
this could not be done.

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Monday, March 11, 2013 12:26:40 PM UTC-4, Vincent wrote:

Hi there,

I'm currently researching the possibilities of the usage of Elasticsearch
for storing log information from a complex back-end. The plan is to track
certain packages through different processes en we therefore need to store
a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on and
sketch a possible cluster specification where this data could be stored.
I'm very curious about peoples opinions on storing this amount of data in
Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

I believe these guys: http://loggly.com/ are doing something similar to
your use case. They use solr as far as I know but I don't see why the same
usecase wouldn't work for Elasticsearch. With recent architecture changes
in lucene and Elasticsearch a lot of what they did two years ago on solr
3.x should be a lot more straightforward now with ES and lucene 4.x.

http://www.loggly.com/blog/2010/08/our-solr-system/

If you google for it, you should be able to find some videos with more in
depth discussion of their architecture.

So, doable & it's been done already. On the other hand, sustaining 4000
documents indexed per second is going to require some major tuning and
testing.

One important decision for you is whether you need real-time access to the
log entries or whether you can afford some latency of say a few minutes. If
the latter, you can bulk index logs and things should scale quite
relatively easily. If you are going to index each log entry separately, you
will need a multi master type setup, i.e. a large cluster since no single
node is likely to sustain that kind of traffic. In Elasticsearch that's a
matter of having more shards and nodes available.

Without knowing too much about your usecase, one idea that comes to mind in
terms of a logical architecture is to create a new index every 24 hours (or
whichever time period you settle on) and use types for each application.
You can manage sharding and replication settings per index and after
creating a new one, your old ones become read only effectively and you can
delete them/back them up when no longer needed. Alternatively, you can
manage indices per application and achieve some isolation between different
applications.

For the physical layout of your cluster it will very much depend on your
querying needs. It sounds to me that you might have the occasional complex
query across all indexes and perhaps some faceting for analytics/reporting.
Given that only the last 24 hours get changes, the analytics queries only
should affect that part of the cluster. So, you probably want some
specialized nodes for indexing traffic and than offload querying to
replicas.

Jilles

On Monday, March 11, 2013 5:26:40 PM UTC+1, Vincent wrote:

Hi there,

I'm currently researching the possibilities of the usage of Elasticsearch
for storing log information from a complex back-end. The plan is to track
certain packages through different processes en we therefore need to store
a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on and
sketch a possible cluster specification where this data could be stored.
I'm very curious about peoples opinions on storing this amount of data in
Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Jilles,

thank you for some very useful insights!

The plan was indeed to create a new index every 24 hours and also the use
of bulk indexing every x-seconds if possible (real-time is not a
requirement). I'm going to do some tests with different types for each
application and measure how this will effect index performance.

Concerning the physical layout, I'm going to look a bit deeper into this
and probably test with different configurations (a specialized node for
indexing traffic sounds like a good idea though).

Regards,
Vincent

Op dinsdag 12 maart 2013 09:37:59 UTC+1 schreef Jilles van Gurp het
volgende:

I believe these guys: http://loggly.com/ are doing something similar to
your use case. They use solr as far as I know but I don't see why the same
usecase wouldn't work for Elasticsearch. With recent architecture changes
in lucene and Elasticsearch a lot of what they did two years ago on solr
3.x should be a lot more straightforward now with ES and lucene 4.x.

http://www.loggly.com/blog/2010/08/our-solr-system/

If you google for it, you should be able to find some videos with more in
depth discussion of their architecture.

So, doable & it's been done already. On the other hand, sustaining 4000
documents indexed per second is going to require some major tuning and
testing.

One important decision for you is whether you need real-time access to the
log entries or whether you can afford some latency of say a few minutes. If
the latter, you can bulk index logs and things should scale quite
relatively easily. If you are going to index each log entry separately, you
will need a multi master type setup, i.e. a large cluster since no single
node is likely to sustain that kind of traffic. In Elasticsearch that's a
matter of having more shards and nodes available.

Without knowing too much about your usecase, one idea that comes to mind
in terms of a logical architecture is to create a new index every 24 hours
(or whichever time period you settle on) and use types for each
application. You can manage sharding and replication settings per index and
after creating a new one, your old ones become read only effectively and
you can delete them/back them up when no longer needed. Alternatively, you
can manage indices per application and achieve some isolation between
different applications.

For the physical layout of your cluster it will very much depend on your
querying needs. It sounds to me that you might have the occasional complex
query across all indexes and perhaps some faceting for analytics/reporting.
Given that only the last 24 hours get changes, the analytics queries only
should affect that part of the cluster. So, you probably want some
specialized nodes for indexing traffic and than offload querying to
replicas.

Jilles

On Monday, March 11, 2013 5:26:40 PM UTC+1, Vincent wrote:

Hi there,

I'm currently researching the possibilities of the usage of Elasticsearch
for storing log information from a complex back-end. The plan is to track
certain packages through different processes en we therefore need to store
a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on and
sketch a possible cluster specification where this data could be stored.
I'm very curious about peoples opinions on storing this amount of data in
Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Vincent,

Looks we do the same things, write/store 8000+ logs/sec. And with some problem now. More info see http://elasticsearch-users.115913.n3.nabble.com/About-frequently-index-writing-in-elasticsearch-cluster-td4031481.html

Kind regards,
Upton Chen

Hey Vincent,

take a look at Graylog2 - It's a free and open source data analytics
solution that is perfect for log management. It uses Elasticsearch as
backend, does dynamic batch indexing and keeps a configurable number of
indices in sizes you can define. Searches are optimized in a way that it
will only use relevant indices if you search in time ranges.

Cheers,
Lennart

On Monday, March 11, 2013 5:26:40 PM UTC+1, Vincent wrote:

Hi there,

I'm currently researching the possibilities of the usage of Elasticsearch
for storing log information from a complex back-end. The plan is to track
certain packages through different processes en we therefore need to store
a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on and
sketch a possible cluster specification where this data could be stored.
I'm very curious about peoples opinions on storing this amount of data in
Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Lennart,

I've already did some research on Graylog2 but I prefer Logstash more. It
has more to offer for our situation and just a Graylog2 can also creates
daily indexes with batch processing (plus it has statsd output with the
Kibana web-interface).

Vincent

Op dinsdag 12 maart 2013 12:36:08 UTC+1 schreef Lennart Koopmann het
volgende:

Hey Vincent,

take a look at Graylog2 - It's a free and open source data analytics
solution that is perfect for log management. It uses Elasticsearch as
backend, does dynamic batch indexing and keeps a configurable number of
indices in sizes you can define. Searches are optimized in a way that it
will only use relevant indices if you search in time ranges.

http://www.graylog2.org/

Cheers,
Lennart

On Monday, March 11, 2013 5:26:40 PM UTC+1, Vincent wrote:

Hi there,

I'm currently researching the possibilities of the usage of Elasticsearch
for storing log information from a complex back-end. The plan is to track
certain packages through different processes en we therefore need to store
a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on and
sketch a possible cluster specification where this data could be stored.
I'm very curious about peoples opinions on storing this amount of data in
Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

We've started down a very similar road. The following links have proven to
be quite helpful. Again, we're just starting but initial benchmarks look
promising. Our throughput requirements are quite similar.
http://edgeofsanity.net/article/2012/12/26/elasticsearch-for-logging.html

http://asquera.de/opensource/2012/11/25/elasticsearch-pre-flight-checklist/

On Tuesday, March 12, 2013 7:30:17 AM UTC-5, Vincent wrote:

Hi Lennart,

I've already did some research on Graylog2 but I prefer Logstash more. It
has more to offer for our situation and just a Graylog2 can also creates
daily indexes with batch processing (plus it has statsd output with the
Kibana web-interface).

Vincent

Op dinsdag 12 maart 2013 12:36:08 UTC+1 schreef Lennart Koopmann het
volgende:

Hey Vincent,

take a look at Graylog2 - It's a free and open source data analytics
solution that is perfect for log management. It uses Elasticsearch as
backend, does dynamic batch indexing and keeps a configurable number of
indices in sizes you can define. Searches are optimized in a way that it
will only use relevant indices if you search in time ranges.

http://www.graylog2.org/

Cheers,
Lennart

On Monday, March 11, 2013 5:26:40 PM UTC+1, Vincent wrote:

Hi there,

I'm currently researching the possibilities of the usage of
Elasticsearch for storing log information from a complex back-end. The plan
is to track certain packages through different processes en we therefore
need to store a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on
and sketch a possible cluster specification where this data could be
stored. I'm very curious about peoples opinions on storing this amount of
data in Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hi Andrew,

thanks for these really useful articles!

  • Vincent

Op donderdag 14 maart 2013 22:30:08 UTC+1 schreef Andrew Ochsner het
volgende:

We've started down a very similar road. The following links have proven
to be quite helpful. Again, we're just starting but initial benchmarks
look promising. Our throughput requirements are quite similar.
divisonbyzero.net

Elasticsearch Platform — Find real-time answers at scale | Elastic

GitHub - elastic/logstash: Logstash - transport and process your logs, events, or other data
http://asquera.de/opensource/2012/11/25/elasticsearch-pre-flight-checklist/

On Tuesday, March 12, 2013 7:30:17 AM UTC-5, Vincent wrote:

Hi Lennart,

I've already did some research on Graylog2 but I prefer Logstash more. It
has more to offer for our situation and just a Graylog2 can also creates
daily indexes with batch processing (plus it has statsd output with the
Kibana web-interface).

Vincent

Op dinsdag 12 maart 2013 12:36:08 UTC+1 schreef Lennart Koopmann het
volgende:

Hey Vincent,

take a look at Graylog2 - It's a free and open source data analytics
solution that is perfect for log management. It uses Elasticsearch as
backend, does dynamic batch indexing and keeps a configurable number of
indices in sizes you can define. Searches are optimized in a way that it
will only use relevant indices if you search in time ranges.

http://www.graylog2.org/

Cheers,
Lennart

On Monday, March 11, 2013 5:26:40 PM UTC+1, Vincent wrote:

Hi there,

I'm currently researching the possibilities of the usage of
Elasticsearch for storing log information from a complex back-end. The plan
is to track certain packages through different processes en we therefore
need to store a vast amount of log data.

We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.

I'm finding it quite hard however to find solid benchmarks to rely on
and sketch a possible cluster specification where this data could be
stored. I'm very curious about peoples opinions on storing this amount of
data in Elasticsearch with commodity hardware.

A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).

For more information don't hesitate to ask, I'm looking forward to any
responses.

Kind regards,
Vincent

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.