In the last week we've made a few changes and now we're seeing a bit of a
problem. We'e seen 3 different occurrences of a single flume agent server
node beginning to back up its FC1 channel indefinitely until we log in and
restart Flume entirely. The data just stops flowing -- we can't find any
errors in the logs on either the ES or Flume side. A simple restart of
Flume fixes it.
This ONLY happens at Midnight, and only happens on one flume server. I'm
wondering whether it has to do with the time it takes our ES nodes to
create a new index ... and the first flume agent that triggers "index
creation" could be getting blocked or stuck?
One additional thing.. we have two ES sinks actually pointing to the same
cluster. The config looks more like this actually:
(inbound avro -> FC1 -> Elasticsearch)
(inbound avro -> FC2 -> S3/HDFS)
(inbound avro_2 -> FC3 -> Elasticsearch)
(inbound avro_2 -> FC4 -> S3/HDFS)
On Thursday, April 10, 2014 9:03:25 AM UTC-7, Matt wrote:
We use Flume 1.4 to pass logs into HDFS as well as Elasticsearch for
storage. The pipeline looks roughly like this:
Client to Server Flow...
(local_app -> local_host_flume_agent) ---- AVRO/SSL ---->
(remote_flume_agent)...
In the last week we've made a few changes and now we're seeing a bit of a
problem. We'e seen 3 different occurrences of a single flume agent server
node beginning to back up its FC1 channel indefinitely until we log in and
restart Flume entirely. The data just stops flowing -- we can't find any
errors in the logs on either the ES or Flume side. A simple restart of
Flume fixes it.
This ONLY happens at Midnight, and only happens on one flume server. I'm
wondering whether it has to do with the time it takes our ES nodes to
create a new index ... and the first flume agent that triggers "index
creation" could be getting blocked or stuck?
I don't know if this helps, but we are seeing similar issues with Flume
using log4j2 (not log4j v1 as used by ES). For tomcat-hosted servlets,
flume failover works fine. But for non-tomcat applications (such as looping
batch-mode applications and Netty-based servers with static main entry
points), we have found that when one of their flume loggers fails, there is
no failover.
We don't have a solution. But a workaround is that the non-tomcat
applications only configure their log4j to write to one flume agent. If
that fails, events are queued until the agent comes back up. No failover,
but no data loss either.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.