I'm really interested in how this will turn out for you because I'm
working on a similar project. I'm expecting similar numbers to yours
pretty soon. Only difference is that we're using rsyslog for now,
instead of Flume. By the way, rsyslog has plugins for both
Elasticsearch and HDFS, in case you're interested
Here's our basic setup for now:
- one index per day of logs (we'll see if we need finer glanularity),
so that we can simply delete old indexes at night pretty much
- source compression (as Shay already suggested)
- the rest of the settings were default (gateway was local file
system, 5 shards per index)
First test was on a high-end laptop (i7@2Ghz, 8GB RAM, 7200 RPM hdd),
and I could get some 5-7K inserts per second using a python script
that was just reading log lines from /var/log/messages, parsed them
and did bulk inserts.
As for searching, it went pretty instant up to about 100M documents.
And it was "acceptable" - as in up to 6 seconds - with some 170M
documents. I remember the index size was 49GB, so I was impressed.
On a cluster, I would need to account one replica per shard (for fault
tolerance), and my calculation was that I would get 1K of index space
per log line. I guess that's largely due to source compression. Maybe
your log lines are bigger, but this is still before disabling _all (as
suggested by Shay).
We will also need to span across multiple datacenters, and the initial
idea was to have an Elasticsearch cluster that will span across all
datacenters, use shard allocation filtering to aggregate logs locally
to each datacenter:
and then search globally from a central interface.
This is all working progress, so I'll keep an eye on this thread. Any
feedback is greatly appreciated
On 12 feb., 07:01, Raaka flinks...@gmail.com wrote:
I'm considering to use Flume to collect logs from systems running in
multiple datacenters. Flume would fork the data flow to HDFS and NRT
indexing. The goal is to have the log entry searchable in less than 30
sec from the time it was generated in the remote server. Basically,
I'd like to build something like Spunk with long term data storage in
I'm new to ES & Lucene and would like to verify if ES is the right
tool for indexing a large volume of logs. Also, I'd like to understand
better how ES scales, how the persistence works and if it supports
multi datacenter deployments for DR.
Assume a situation where thousands of servers generate logs with rate
of e.g. 1TB/day. To make math easy, let's say a single log entry is
1kB on average. This would translate into 11,500 log entries per
second on average. The log entry should be searchable for 14 days and
then expire from the index.
- would each 1k log entry be considered a "document" in ES?
- 1TB/day x two weeks = 14TB of searchable logs. Where does ES
persists the documents? Local file system?
- is it reasonable to expect ES cluster to be able index & search this
rate of logs?
- how does ES scale? E.g. if the log indexing rate goes from 10/sec to
100/sec to 1000/sec, where's the bottleneck?
- the datacenter hosting ES just went down, now what? How would you
Any comments appreciated,