I believe these guys: http://loggly.com/ are doing something similar to
your use case. They use solr as far as I know but I don't see why the same
usecase wouldn't work for Elasticsearch. With recent architecture changes
in lucene and Elasticsearch a lot of what they did two years ago on solr
3.x should be a lot more straightforward now with ES and lucene 4.x.
http://www.loggly.com/blog/2010/08/our-solr-system/
If you google for it, you should be able to find some videos with more in
depth discussion of their architecture.
So, doable & it's been done already. On the other hand, sustaining 4000
documents indexed per second is going to require some major tuning and
testing.
One important decision for you is whether you need real-time access to the
log entries or whether you can afford some latency of say a few minutes. If
the latter, you can bulk index logs and things should scale quite
relatively easily. If you are going to index each log entry separately, you
will need a multi master type setup, i.e. a large cluster since no single
node is likely to sustain that kind of traffic. In Elasticsearch that's a
matter of having more shards and nodes available.
Without knowing too much about your usecase, one idea that comes to mind in
terms of a logical architecture is to create a new index every 24 hours (or
whichever time period you settle on) and use types for each application.
You can manage sharding and replication settings per index and after
creating a new one, your old ones become read only effectively and you can
delete them/back them up when no longer needed. Alternatively, you can
manage indices per application and achieve some isolation between different
applications.
For the physical layout of your cluster it will very much depend on your
querying needs. It sounds to me that you might have the occasional complex
query across all indexes and perhaps some faceting for analytics/reporting.
Given that only the last 24 hours get changes, the analytics queries only
should affect that part of the cluster. So, you probably want some
specialized nodes for indexing traffic and than offload querying to
replicas.
Jilles
On Monday, March 11, 2013 5:26:40 PM UTC+1, Vincent wrote:
Hi there,
I'm currently researching the possibilities of the usage of Elasticsearch
for storing log information from a complex back-end. The plan is to track
certain packages through different processes en we therefore need to store
a vast amount of log data.
We're talking about 4000+ logs/sec > 345+ million logs/day > 11
billion logs/month. For each day a index will be made and after each month
the data will be archived/deleted (we gather statistics and store this for
a overview historical performance). All this data is logged from a dozen of
different processes running on different machines, but that is not the
issue.
I'm finding it quite hard however to find solid benchmarks to rely on and
sketch a possible cluster specification where this data could be stored.
I'm very curious about peoples opinions on storing this amount of data in
Elasticsearch with commodity hardware.
A good alternative would be Cassandra (with SOLR perhaps). I know these
two are totally different database solutions but both are a possible fit
for our use-case (though the flexibility of Elasticsearch has the edge
right now).
For more information don't hesitate to ask, I'm looking forward to any
responses.
Kind regards,
Vincent
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.