What is the best way to collect yarn application logs from hdfs?

We have a dynamic infrastructure for Hadoop, which means the yarn application logs only exist for a limited period of time.

Currently we are using Splunk HadoopConnect to ingest those logs as a live feed into Splunk. However this requires the installation of the full Splunk server on the cluster to accomplish, which is not only resource intensive, but not the most efficient thing to do every time we spin up a dynamic Hadoop cluster.

Does Elasticsearch, Logstash, etc have an alternative to HadoopConnect that could be used to collect the Yarn application logs out of HDFS and feed them into the ELK stack?

I don't have much experience with Splunk's HadoopConnect or know much about what it even is, but in terms of collecting log files, I would suggest something like starting a Filebeat along side your NodeManager instances as they come online and tearing it down as they come offline.

Granted, this means running a data shipper process on all the nodes that you would want to collect data from (my recollection is that you can get the YARN application logs from the local directories on the NodeManagers/ResourceManagers that launch them, but your setup may be different than what I'm used to).

Additionally, I know of a tool that one of the engineers at Elastic has made public called FSCrawler, which has a blurb about indexing data through an HDFS NFS Gateway. Maybe that would be easier to implement instead of using a sidecar datashipper on dynamic infrastructure?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.