Well, in my case, it doesn't really matter if an index is growing due to
normal or abnormal log activity. What matters is to make sure that if one
application starts to generate excessive logs, it will not fill all disk
space available and leave no room to the others. I don't care if I lose the
logs of that application. It's more important to preserve log retention
capabilities for all the other applications that are behaving normally.
If one application has a bug and generates 1 zillion log entries in 10
minutes, I don't want to find out after one hour that I no longer have logs
of other 9 applications, from that time, just because one single
application filled all the available disk space in the ES cluster before
Curator had the chance to run and release some space.
I think an ideal log management solution should behave somewhat like
a collection of FIFO queues, each limited by its disk space quota. It
surprises me to see that these space management issues are not a pressing
concern in the ELK users community. As more and more companies are
migrating to service-based architectures, which require multiple smaller
applications (services) to run separately in the production environment, it
seems to me that the scenario I described should be very common these days.
Popular log frameworks, like Log4J, have addressed the disk management
problem a long time ago, with rotating file appenders.
Every single system in production will, at some point, consume all the disk
space available for logs, and, from that point on, regularly delete older
logs to make room for new entries.
I understand the complexities of implementing disk space management in a
sharded system, but I see that the ELK stack is 'advertised' as a very
robust log management solution; yet, it doesn't address such a common
requirement of log management systems.
I wonder if there's anything better on the market for these purposes.
Em segunda-feira, 16 de fevereiro de 2015 17:19:47 UTC-2, Aaron Mildenstein
escreveu:
Delete by space is extremely hard to do well with a fully distributed
system, like Elasticsearch. You could have 2 or more shards (primary or
replica) from one of the "busy" indices you have indicated residing on one
node, and none on another. How do you determine when disk space is filling
as a result of "busy" vs. "normal" log activity. How does a system know
which indices are potentially "problem" indices with too much data vs.
"normal" indices? Curator specifically recommends against using delete
by space because of these shortcomings.
A secondary system becomes necessary to manage delete-by-space with a
distributed system. You wind up having to do something like what
Curator does, by summing the space consumed by adding all shards together,
but would have to do it based on index names, or name patterns, and do
alerting on the results. It would also have to show "per pattern" usage
per node, since data is distributed. Such a system would require constant
monitoring, alerting, and/or acting. Elasticsearch is not (yet, at least)
designed to do this.
Again, delete by disk usage is a very difficult problem to solve with a
sharded, distributed system.
You could write your own monitoring system, based on your own usage or the
suggestions I made above, and make use of the Curator API (
http://curator.readthedocs.org) to do the behind-the-scenes work.
Good luck,
--Aaron
On Sunday, February 15, 2015 at 11:34:37 AM UTC-7, Gabriel Corrêa de
Oliveira wrote:
Dear All,
I am trying to use the ELK stack in the following scenario:
I have about ten applications that send their logs, through Logstash, to
a single Elasticsearch cluster.
Some of these applications naturally generate more logs than others, and,
sometimes, one of them can go 'crazy', because of a bug, for instance, and,
thus, generate even more log entries than it normally does. As result, the
disk space available in the cluster can be unfairly 'taken' by the logs of
a single application, leaving not enough room to others.
I am currently managing the available disk space through Elasticsearch
Curator. It runs periodically, as it is in the crontab, and deletes older
indices based on a disk usage quota. When the disk space used by all
indices exceeds a certain limit, the oldest indices are deleted, one by
one, util the sum of the disk space used by them all is within the
specified limit again.
The first problem with this approach is that Elasticsearch Curator can
only delete entire indices. Hence, I had to configure Logstash to create
one different index per hour, and increase their granularity; thus, Curator
deletes smaller chunks of logs at a time. In addition, it is very difficult
to decide how often Curator should run. If applications are generating logs
at a higher rate, not even one-hour indices may be enough. Secondly, there
is no way to specify a disk usage quota for each different application.
Ideally, Elasticsearch should be able to delete older log entries by
itself whenever the indices reach a certain disk usage limit. This would
eliminate the problem of defining how often Curator should run. However, I
could not find any similar feature in the Elasticsearch manual.
Would anybody recommend a different approach to address these issues?
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/719667c3-7fe3-4cb5-8242-6cbd21cf472c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.