Hi all…
We've been using ELK stack via a Docker container (sebp/elk
) for some time along-side ElastAlert to monitor a number of virtual hosts in our care via collectd
, and until very recently it's been working well.
We were running ElasticSearch 2.3.4/Logstash 2.3.4/Kibana 4.5.3… and a week two back, I did an update of that container through a number of revisions, migrating everything to ElasticSearch 7.6.1.
The (virtual) host is running Ubuntu 16.04, we just have the one VM running ElasticSearch. collectd
was not changed, and things seemed to be working at the time. The latest release for that Ubuntu version (5.5.1-1build2
) is installed.
We have a port exposed on the loopback interface from the ELK stack docker container for collectd
to talk to. We have also mounted the directory where collectd
keeps its types.db
so it can be accessed by logstash
.
# docker-compose.yml
services:
# ELK Stack production configuration
elk:
restart: always
volumes:
# Mount logstash configuration from the host.
- /etc/logstash:/etc/logstash
- /etc/elasticsearch:/etc/elasticsearch
- /usr/share/collectd:/usr/share/collectd
- /var/backup/elk:/var/backup
environment:
# Heap size should be under 50% of available memory, but should be greater than 1GB
# https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
# https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
- ES_HEAP_SIZE=1g
ports:
- "127.0.0.1:514:5514/udp" # Syslog loopback socket
- "127.0.0.1:25827:25826/udp" # collectd loopback socket
# etc … other port mappings here
# collectd.conf
<Plugin network>
# client setup:
Server "127.0.0.1" "25826" # ELK stack container
# server setup:
<Listen "10.1.1.1" "25826">
SecurityLevel Encrypt
AuthFile "/etc/collectd/passwd"
Interface "eth1"
</Listen>
<Listen "10.2.2.2" "25826">
SecurityLevel Encrypt
AuthFile "/etc/collectd/passwd"
Interface "tun0"
</Listen>
# # proxy setup (client and server as above):
Forward true
</Plugin>
# logstash/conf.d/04-collectd.conf
input {
udp {
host => "0.0.0.0"
port => 25826
buffer_size => 1452
type => "collectd"
codec => collectd {
typesdb => "/usr/share/collectd/types.db"
}
}
}
We're noting that whilst most of our instances' collectd
stats are making it through to ELK stack, the host actually running the collectd
proxy and ELK stack itself is not showing up.
When using tcpdump
to capture the traffic on lo
, all traffic is seen heading into the ELK-stack instance, so presumably is being received by logstash
. A frame from one of the nodes which isn't appearing looks like this when analysed by Wireshark:
No. Time Source Destination Protocol Length Info
27 28.899751 127.0.0.1 127.0.0.1 collectd 1373 Host=elkstack.example.com, 27 values for 12 plugins, 0 messages
Frame 27: 1373 bytes on wire (10984 bits), 1373 bytes captured (10984 bits)
…snip…
collectd network data
collectd HOST segment: "elkstack.example.com"
Type: HOST (0x0000)
Length: 32
Host name: elkstack.example.com
collectd TIME_HR segment: May 6, 2020 10:41:03.590266326 EST
Type: TIME_HR (0x0008)
Length: 12
Timestamp: May 6, 2020 10:41:03.590266326 EST
collectd INTERVAL_HR segment: 1 minute
Type: INTERVAL_HR (0x0009)
Length: 12
Interval: 60.000000000 seconds
collectd PLUGIN segment: "df"
Type: PLUGIN (0x0002)
Length: 7
Plugin: df
collectd PLUGIN_INSTANCE segment: "root"
Type: PLUGIN_INSTANCE (0x0003)
Length: 9
Plugin instance: root
collectd TYPE segment: "df_complex"
Type: TYPE (0x0004)
Length: 15
Type: df_complex
collectd TYPE_INSTANCE segment: "used"
Type: TYPE_INSTANCE (0x0005)
Length: 9
Type instance: used
collectd VALUES segment: 1 value
Type: VALUES (0x0006)
Length: 15
Value count: 1
1 value
Gauge: 2.94918e+11
Value type: GAUGE (0x01)
Gauge value: 294917779456
[Assembled metric]
Host name: elkstack.example.com
Plugin: df
Plugin instance: root
Type: df_complex
Type instance: used
Timestamp: May 6, 2020 10:41:03.590266326 EST
Interval: 60.000000000 seconds
collectd TIME_HR segment: May 6, 2020 10:41:03.589932766 EST
Type: TIME_HR (0x0008)
Length: 12
Timestamp: May 6, 2020 10:41:03.589932766 EST
collectd PLUGIN segment: "memory"
Type: PLUGIN (0x0002)
Length: 11
Plugin: memory
collectd PLUGIN_INSTANCE segment: ""
Type: PLUGIN_INSTANCE (0x0003)
Length: 5
Plugin instance:
collectd TYPE segment: "percent"
Type: TYPE (0x0004)
Length: 12
Type: percent
collectd VALUES segment: 1 value
Type: VALUES (0x0006)
Length: 15
Value count: 1
1 value
Gauge: 38.871
Value type: GAUGE (0x01)
Gauge value: 38.8709877262965
[Assembled metric]
Host name: elkstack.example.com
Plugin: memory
Plugin instance:
Type: percent
Type instance: used
Timestamp: May 6, 2020 10:41:03.589932766 EST
Interval: 60.000000000 seconds
collectd TIME_HR segment: May 6, 2020 10:41:03.590267718 EST
Type: TIME_HR (0x0008)
Length: 12
Timestamp: May 6, 2020 10:41:03.590267718 EST
collectd PLUGIN segment: "df"
Type: PLUGIN (0x0002)
Length: 7
Plugin: df
collectd PLUGIN_INSTANCE segment: "root"
Type: PLUGIN_INSTANCE (0x0003)
Length: 9
Plugin instance: root
collectd TYPE segment: "percent_bytes"
Type: TYPE (0x0004)
Length: 18
Type: percent_bytes
collectd TYPE_INSTANCE segment: "reserved"
Type: TYPE_INSTANCE (0x0005)
Length: 13
Type instance: reserved
collectd VALUES segment: 1 value
Type: VALUES (0x0006)
Length: 15
Value count: 1
1 value
Gauge: 5.08306
Value type: GAUGE (0x01)
Gauge value: 5.08306264877319
[Assembled metric]
Host name: elkstack.example.com
Plugin: df
Plugin instance: root
Type: percent_bytes
Type instance: reserved
Timestamp: May 6, 2020 10:41:03.590267718 EST
Interval: 60.000000000 seconds
collectd TIME_HR segment: May 6, 2020 10:41:03.590267107 EST
Type: TIME_HR (0x0008)
Length: 12
Timestamp: May 6, 2020 10:41:03.590267107 EST
collectd TYPE_INSTANCE segment: "free"
Type: TYPE_INSTANCE (0x0005)
Length: 9
Type instance: free
collectd VALUES segment: 1 value
Type: VALUES (0x0006)
Length: 15
Value count: 1
1 value
Gauge: 7.73987
Value type: GAUGE (0x01)
Gauge value: 7.73987340927124
[Assembled metric]
Host name: elkstack.example.com
Plugin: df
Plugin instance: root
Type: percent_bytes
Type instance: free
Timestamp: May 6, 2020 10:41:03.590267107 EST
Interval: 60.000000000 seconds
collectd TIME_HR segment: May 6, 2020 10:41:03.590268296 EST
Type: TIME_HR (0x0008)
Length: 12
Timestamp: May 6, 2020 10:41:03.590268296 EST
collectd TYPE_INSTANCE segment: "used"
Type: TYPE_INSTANCE (0x0005)
Length: 9
Type instance: used
collectd VALUES segment: 1 value
Type: VALUES (0x0006)
Length: 15
Value count: 1
1 value
Gauge: 87.1771
Value type: GAUGE (0x01)
Gauge value: 87.1770553588867
[Assembled metric]
Host name: elkstack.example.com
Plugin: df
Plugin instance: root
Type: percent_bytes
Type instance: used
Timestamp: May 6, 2020 10:41:03.590268296 EST
Interval: 60.000000000 seconds
collectd TIME_HR segment: May 6, 2020 10:41:03.589932766 EST
Type: TIME_HR (0x0008)
Length: 12
Timestamp: May 6, 2020 10:41:03.589932766 EST
collectd PLUGIN segment: "memory"
Type: PLUGIN (0x0002)
Length: 11
Plugin: memory
collectd PLUGIN_INSTANCE segment: ""
Type: PLUGIN_INSTANCE (0x0003)
Length: 5
Plugin instance:
collectd TYPE segment: "percent"
Type: TYPE (0x0004)
Length: 12
Type: percent
collectd TYPE_INSTANCE segment: "slab_unrecl"
Type: TYPE_INSTANCE (0x0005)
Length: 16
Type instance: slab_unrecl
collectd VALUES segment: 1 value
Type: VALUES (0x0006)
Length: 15
Value count: 1
1 value
Gauge: 0.379304
Value type: GAUGE (0x01)
Gauge value: 0.379303957034613
[Assembled metric]
Host name: elkstack.example.com
Plugin: memory
Plugin instance:
Type: percent
Type instance: slab_unrecl
Timestamp: May 6, 2020 10:41:03.589932766 EST
Interval: 60.000000000 seconds
collectd HOST segment: "customer1.example.com"
Type: HOST (0x0000)
Length: 25
Host name: customer1.example.com
collectd TIME_HR segment: May 6, 2020 10:40:05.514224234 EST
… snip more data because forum software couldn't take it …
The data for customer1.example.com
does show up, but not elkstack.example.com
. I have a suspicion this is issue #13 rearing its ugly head, but I'm not sure.
Is there way I can get logstash
to report what it sees from collectd
?