Disk and file system usage collection using Logstash

Hi All,

Looking for some help to implement disk and file system (files and directories) usage collection by Logstash and storage in Elasticsearch.

Basically, data processing workflows that we want to evaluate server performance along side data access and data storage. We are using Topbeat with Logstash, Elasticsearch and Kibana and it would be good to have a comprehensive time series data set that can all be easily analysed. We will have application logs to store along with the disk and file system usage metrics (and/or logs).

Collectd (df and disk modules) is storing data in InfluxDB for redundancy.

Have looked at 'df' and 'du' commands and they seem to produce the needed information which could be scripted, logged, parsed and collected. Not sure if this is the way to go.

Hope there are plenty of operations people out there that have addressed this use case.

Grateful for assistance.

Why not configure collectd to send data to Logstash as well?

Good suggestion. Must have missed that for some reason, will do that next.

The other thing is about monitoring all of data files on the file system. Not sure if collectd does that?

'du /data/ -a --time' seemed to give a good output for all files, storage and modification time.

Any suggestions for that?

The other thing is about monitoring all of data files on the file system.

Logging the size of every file in the file system? No, I don't believe collectd does that. Perhaps you can use Logstash's exec input to run a small script, or write a collectd plugin.

Thanks again, sounds like will need to do a little bit of work to get the file usage information collected.

Might start with a small script and have a look at a collectd module.

Looking for a relatively simple and elegant solution that allows the metrics to be captured so measure data growth and report on it.

Guessing nothing out of the box for Beats?

Okay, so, added the basic configuration settings to /etc/logstash/conf.d/logstash.conf

https://www.elastic.co/guide/en/logstash/current/plugins-codecs-collectd.html

input {
beats {
port => 5044
}
udp {
port => 25826
buffer_size => 1452
codec => collectd { }
}
}
output {
elasticsearch {
hosts => "localhost:9200"
manage_template => false
index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"
document_type => "%{[@metadata][type]}"
}
}

Stopped and started the logstash service but could not see an index in Elasticsearch?

Logstash seems to running fine.

Do we need to create an index for collectd in Elasticsearch?

Topbeats and Kibana created the indices automatically.

Collectd is running fine and storing data to Influxdb.

Not sure what to do next?

index => "%{[@metadata][beat]}-%{+YYYY.MM.dd}"

This isn't a very good idea unless you only have Beats-based inputs. The field you reference here won't be set for events from the udp input so the index name will be e.g. %{[@metadata][beat]}-2016.03.09.

document_type => "%{[@metadata][type]}"

Same thing here.

Disable the elasticsearch output for now and use a simple stdout { codec => rubydebug } output. Once things look as you expect, try enabling the ES output again.

Okay great. Thank you will test now.

Should the syslog plugin create the index automatically in Elasticsearch?

Should the syslog plugin create the index automatically in Elasticsearch?

Input plugins never create any indexes. It's always the elasticsearch output. But yes, indexes will be created as necessary.

Right, yes of course.

Okay, it now looks like data is being stored in the logstash index.

yellow open logstash-2016.03.09 5 1 3685 0 3.1mb 3.1mb
yellow open logstash-2016.03.09 5 1 7300 0 3.9mb 3.9mb

Will that be the case for most of the Logstash plugins (syslog, collectd, exec)?

I'm not sure what you're asking. The input plugins are agnostic about the outputs. If your elasticsearch output is configured to send all events to logstash-%{+YYYY.MM.DD} then that's where all events will go.

Okay, how can I configure Logstash to output each input plugin into their own index?

eg

syslog plugin to syslog index
collectd plugin to collectd index
exec plugin to exec index

Is that a sensible thing to do?

Seems that would be a good way to easily monitor index growth for each plugin, plugin operation and make easier the tuning of each plugin for the amount of data collected and frequency of data collection.

Okay, how can I configure Logstash to output each input plugin into their own index?

See elasticsearch - Make logstash add different inputs to different indices - Stack Overflow.

Is that a sensible thing to do?

It's recommended by the Elastic folks, but keep in mind that having too many shards per ES node is a bad idea. It's even more important that you review the default shards count of five.

Excellent. Will follow the stackoverflow post - that looks reasonably self explanatory

Please could you provide a reference for the default shards count and recommended settings?

It's a step learning curving appreciate your support (first class)

1 Like

I think the Designing for Scale chapter of The Definitive Guide will answer most of your questions.

Thanks again, will delve into the chapter and get on top of that topic..

Will update on the plugin index configurations after testing..

Not seeing the syslog index appear, doe this look okay?

input {
beats {
port => 5044
}
tcp {
port => 5000
type => syslog
}
udp {
port => 5000
type => syslog
}
}
filter {
if [type] == "syslog" {
grok {
match => { "message" => "%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}" }
add_field => [ "received_at", "%{@timestamp}" ]
add_field => [ "received_from", "%{host}" ]
}
date {
match => [ "syslog_timestamp", "MMM d HH:mm:ss", "MMM dd HH:mm:ss" ]
}
}
}
output {
if [type] == "syslog" {
elasticsearch {
hosts => "localhost:9200"
index => "syslog"
}
} else {
elasticsearch {
hosts => "localhost:9200"
manage_template => false
}
}
}

Looks okay. Check the Logstash logs.