Flume-NG ElasticSearch Sink Backing up @ Midnight


(Matt Wise) #1

We use Flume 1.4 to pass logs into HDFS as well as ElasticSearch for
storage. The pipeline looks roughly like this:

Client to Server Flow...
(local_app -> local_host_flume_agent) ---- AVRO/SSL ---->
(remote_flume_agent)...

Agent Server Flow ...
(inbound avro -> FC1 -> ElasticSearch)
(inbound avro -> FC2 -> S3/HDFS)

In the last week we've made a few changes and now we're seeing a bit of a
problem. We'e seen 3 different occurrences of a single flume agent server
node beginning to back up its FC1 channel indefinitely until we log in and
restart Flume entirely. The data just stops flowing -- we can't find any
errors in the logs on either the ES or Flume side. A simple restart of
Flume fixes it.

Our sink config looks like this:

agent.sinks.elasticsearch.type =
org.apache.flume.sink.elasticsearch.ElasticSearchSink
agent.sinks.elasticsearch.hostNames = xxx:9300
agent.sinks.elasticsearch.indexName = flume
agent.sinks.elasticsearch.clusterName =
flume-elasticsearch-production-useast1
agent.sinks.elasticsearch.batchSize = 1000
agent.sinks.elasticsearch.ttl = 30
agent.sinks.elasticsearch.serializer =
org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer
agent.sinks.elasticsearch.channel = fc-unstructured-es

This ONLY happens at Midnight, and only happens on one flume server. I'm
wondering whether it has to do with the time it takes our ES nodes to
create a new index ... and the first flume agent that triggers "index
creation" could be getting blocked or stuck?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e7892490-d2f6-442f-ae25-18b59021e7e4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Matt Wise) #2

One additional thing.. we have two ES sinks actually pointing to the same
cluster. The config looks more like this actually:
(inbound avro -> FC1 -> ElasticSearch)
(inbound avro -> FC2 -> S3/HDFS)
(inbound avro_2 -> FC3 -> ElasticSearch)
(inbound avro_2 -> FC4 -> S3/HDFS)

On Thursday, April 10, 2014 9:03:25 AM UTC-7, Matt wrote:

We use Flume 1.4 to pass logs into HDFS as well as ElasticSearch for
storage. The pipeline looks roughly like this:

Client to Server Flow...
(local_app -> local_host_flume_agent) ---- AVRO/SSL ---->
(remote_flume_agent)...

Agent Server Flow ...
(inbound avro -> FC1 -> ElasticSearch)
(inbound avro -> FC2 -> S3/HDFS)

In the last week we've made a few changes and now we're seeing a bit of a
problem. We'e seen 3 different occurrences of a single flume agent server
node beginning to back up its FC1 channel indefinitely until we log in and
restart Flume entirely. The data just stops flowing -- we can't find any
errors in the logs on either the ES or Flume side. A simple restart of
Flume fixes it.

Our sink config looks like this:

agent.sinks.elasticsearch.type =
org.apache.flume.sink.elasticsearch.ElasticSearchSink
agent.sinks.elasticsearch.hostNames = xxx:9300
agent.sinks.elasticsearch.indexName = flume
agent.sinks.elasticsearch.clusterName =
flume-elasticsearch-production-useast1
agent.sinks.elasticsearch.batchSize = 1000
agent.sinks.elasticsearch.ttl = 30
agent.sinks.elasticsearch.serializer =
org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer
agent.sinks.elasticsearch.channel = fc-unstructured-es

This ONLY happens at Midnight, and only happens on one flume server. I'm
wondering whether it has to do with the time it takes our ES nodes to
create a new index ... and the first flume agent that triggers "index
creation" could be getting blocked or stuck?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3056add8-40e8-4156-b8d6-834f34baf8c0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(Brian Yoder) #3

Matt,

I don't know if this helps, but we are seeing similar issues with Flume
using log4j2 (not log4j v1 as used by ES). For tomcat-hosted servlets,
flume failover works fine. But for non-tomcat applications (such as looping
batch-mode applications and Netty-based servers with static main entry
points), we have found that when one of their flume loggers fails, there is
no failover.

We don't have a solution. But a workaround is that the non-tomcat
applications only configure their log4j to write to one flume agent. If
that fails, events are queued until the agent comes back up. No failover,
but no data loss either.

Brian

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/e51a0926-7742-4cd7-877d-1c1d12ed2a1d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


(system) #4