Disk and file system usage collection using Logstash

gabe · March 8, 2016, 6:18am

Hi All,

Looking for some help to implement disk and file system (files and directories) usage collection by Logstash and storage in Elasticsearch.

Basically, data processing workflows that we want to evaluate server performance along side data access and data storage. We are using Topbeat with Logstash, Elasticsearch and Kibana and it would be good to have a comprehensive time series data set that can all be easily analysed. We will have application logs to store along with the disk and file system usage metrics (and/or logs).

Collectd (df and disk modules) is storing data in InfluxDB for redundancy.

Have looked at 'df' and 'du' commands and they seem to produce the needed information which could be scripted, logged, parsed and collected. Not sure if this is the way to go.

Hope there are plenty of operations people out there that have addressed this use case.

Grateful for assistance.

magnusbaeck · March 8, 2016, 6:42am

Why not configure collectd to send data to Logstash as well?

gabe · March 8, 2016, 6:48am

Good suggestion. Must have missed that for some reason, will do that next.

The other thing is about monitoring all of data files on the file system. Not sure if collectd does that?

'du /data/ -a --time' seemed to give a good output for all files, storage and modification time.

Any suggestions for that?

magnusbaeck · March 8, 2016, 7:01am

The other thing is about monitoring all of data files on the file system.

Logging the size of every file in the file system? No, I don't believe collectd does that. Perhaps you can use Logstash's exec input to run a small script, or write a collectd plugin.

gabe · March 8, 2016, 7:13am

Thanks again, sounds like will need to do a little bit of work to get the file usage information collected.

Might start with a small script and have a look at a collectd module.

Looking for a relatively simple and elegant solution that allows the metrics to be captured so measure data growth and report on it.

Guessing nothing out of the box for Beats?

gabe · March 9, 2016, 3:00am

Okay, so, added the basic configuration settings to /etc/logstash/conf.d/logstash.conf

https://www.elastic.co/guide/en/logstash/current/plugins-codecs-collectd.html

input {
beats {
port => 5044
}
udp {
port => 25826
buffer_size => 1452
codec => collectd { }
}
}
output {
elasticsearch {
hosts => "localhost:9200"
manage_template => false
index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
document_type => "%{[@metadata][type]}"
}
}

Stopped and started the logstash service but could not see an index in Elasticsearch?

Logstash seems to running fine.

Do we need to create an index for collectd in Elasticsearch?

Topbeats and Kibana created the indices automatically.

Collectd is running fine and storing data to Influxdb.

Not sure what to do next?

magnusbaeck · March 9, 2016, 6:28am

index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"

This isn't a very good idea unless you only have Beats-based inputs. The field you reference here won't be set for events from the udp input so the index name will be e.g. %{[@metadata][beat]}-2016.03.09.

document_type => "%{[@metadata][type]}"

Same thing here.

Disable the elasticsearch output for now and use a simple stdout { codec => rubydebug } output. Once things look as you expect, try enabling the ES output again.

gabe · March 9, 2016, 6:31am

Okay great. Thank you will test now.

Should the syslog plugin create the index automatically in Elasticsearch?

magnusbaeck · March 9, 2016, 6:32am

Should the syslog plugin create the index automatically in Elasticsearch?

Input plugins never create any indexes. It's always the elasticsearch output. But yes, indexes will be created as necessary.

gabe · March 9, 2016, 6:34am

Right, yes of course.

Okay, it now looks like data is being stored in the logstash index.

yellow open logstash-2016.03.09 5 1 3685 0 3.1mb 3.1mb
yellow open logstash-2016.03.09 5 1 7300 0 3.9mb 3.9mb

Will that be the case for most of the Logstash plugins (syslog, collectd, exec)?

magnusbaeck · March 9, 2016, 6:43am

I'm not sure what you're asking. The input plugins are agnostic about the outputs. If your elasticsearch output is configured to send all events to logstash-%{+YYYY.MM.DD} then that's where all events will go.

gabe · March 9, 2016, 6:48am

Okay, how can I configure Logstash to output each input plugin into their own index?

eg

syslog plugin to syslog index
collectd plugin to collectd index
exec plugin to exec index

Is that a sensible thing to do?

Seems that would be a good way to easily monitor index growth for each plugin, plugin operation and make easier the tuning of each plugin for the amount of data collected and frequency of data collection.

magnusbaeck · March 9, 2016, 6:55am

Okay, how can I configure Logstash to output each input plugin into their own index?

See elasticsearch - Make logstash add different inputs to different indices - Stack Overflow.

Is that a sensible thing to do?

It's recommended by the Elastic folks, but keep in mind that having too many shards per ES node is a bad idea. It's even more important that you review the default shards count of five.

gabe · March 9, 2016, 6:59am

Excellent. Will follow the stackoverflow post - that looks reasonably self explanatory

Please could you provide a reference for the default shards count and recommended settings?

It's a step learning curving appreciate your support (first class)

magnusbaeck · March 9, 2016, 7:29am

I think the Designing for Scale chapter of The Definitive Guide will answer most of your questions.

gabe · March 9, 2016, 7:33am

Thanks again, will delve into the chapter and get on top of that topic..

Will update on the plugin index configurations after testing..

gabe · March 9, 2016, 7:50am

Not seeing the syslog index appear, doe this look okay?

input {
beats {
port => 5044
}
tcp {
port => 5000
type => syslog
}
udp {
port => 5000
type => syslog
}
}
filter {
if [type] == "syslog" {
grok {
match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}" }
add_field => [ "received_at", "%{@timestamp}" ]
add_field => [ "received_from", "%{host}" ]
}
date {
match => [ "syslog_timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ]
}
}
}
output {
if [type] == "syslog" {
elasticsearch {
hosts => "localhost:9200"
index => "syslog"
}
} else {
elasticsearch {
hosts => "localhost:9200"
manage_template => false
}
}
}

magnusbaeck · March 9, 2016, 8:34am

Looks okay. Check the Logstash logs.

Topic		Replies	Views
Disk utilization in elk server Kibana	7	926	September 4, 2017
Elasticsearch Disk space issues Elasticsearch	5	3434	June 1, 2017
How can I make collectd report which device/mount point it is looking at? Logstash	10	3096	July 6, 2017
Audit log Logstash	10	2816	March 26, 2019
Logstash 1.5 system metrics plugin Logstash	3	651	July 6, 2017

Disk and file system usage collection using Logstash

Related topics