`collectd` codec intermittently dropping some telemetry information

Hi all…

We've been using ELK stack via a Docker container (sebp/elk) for some time along-side ElastAlert to monitor a number of virtual hosts in our care via collectd, and until very recently it's been working well.

We were running ElasticSearch 2.3.4/Logstash 2.3.4/Kibana 4.5.3… and a week two back, I did an update of that container through a number of revisions, migrating everything to ElasticSearch 7.6.1.

The (virtual) host is running Ubuntu 16.04, we just have the one VM running ElasticSearch. collectd was not changed, and things seemed to be working at the time. The latest release for that Ubuntu version (5.5.1-1build2) is installed.

We have a port exposed on the loopback interface from the ELK stack docker container for collectd to talk to. We have also mounted the directory where collectd keeps its types.db so it can be accessed by logstash.

# docker-compose.yml
services:
  # ELK Stack production configuration
  elk:
    restart: always
    volumes:
      # Mount logstash configuration from the host.
      - /etc/logstash:/etc/logstash
      - /etc/elasticsearch:/etc/elasticsearch
      - /usr/share/collectd:/usr/share/collectd
      - /var/backup/elk:/var/backup
    environment:
    # Heap size should be under 50% of available memory, but should be greater than 1GB
    # https://www.elastic.co/guide/en/elasticsearch/guide/current/_limiting_memory_usage.html
    # https://www.elastic.co/guide/en/elasticsearch/guide/current/heap-sizing.html
    - ES_HEAP_SIZE=1g
    ports:
    - "127.0.0.1:514:5514/udp"  # Syslog loopback socket
    - "127.0.0.1:25827:25826/udp" # collectd loopback socket
    # etc … other port mappings here
# collectd.conf
<Plugin network>
        # client setup:
        Server "127.0.0.1" "25826" # ELK stack container
        # server setup:
        <Listen "10.1.1.1" "25826">
                SecurityLevel Encrypt
                AuthFile "/etc/collectd/passwd"
                Interface "eth1"
        </Listen>
        <Listen "10.2.2.2" "25826">
                SecurityLevel Encrypt
                AuthFile "/etc/collectd/passwd"
                Interface "tun0"
        </Listen>
#       # proxy setup (client and server as above):
        Forward true
</Plugin>
# logstash/conf.d/04-collectd.conf
input {
  udp {
    host => "0.0.0.0"
    port => 25826
    buffer_size => 1452
    type => "collectd"
    codec => collectd {
      typesdb => "/usr/share/collectd/types.db"
    }
  }
}

We're noting that whilst most of our instances' collectd stats are making it through to ELK stack, the host actually running the collectd proxy and ELK stack itself is not showing up.

When using tcpdump to capture the traffic on lo, all traffic is seen heading into the ELK-stack instance, so presumably is being received by logstash. A frame from one of the nodes which isn't appearing looks like this when analysed by Wireshark:

No.     Time           Source                Destination           Protocol Length Info
     27 28.899751      127.0.0.1             127.0.0.1             collectd 1373   Host=elkstack.example.com, 27 values for 12 plugins, 0 messages

Frame 27: 1373 bytes on wire (10984 bits), 1373 bytes captured (10984 bits)
…snip…
collectd network data
    collectd HOST segment: "elkstack.example.com"
        Type: HOST (0x0000)
        Length: 32
        Host name: elkstack.example.com
    collectd TIME_HR segment: May  6, 2020 10:41:03.590266326 EST
        Type: TIME_HR (0x0008)
        Length: 12
        Timestamp: May  6, 2020 10:41:03.590266326 EST
    collectd INTERVAL_HR segment: 1 minute
        Type: INTERVAL_HR (0x0009)
        Length: 12
        Interval: 60.000000000 seconds
    collectd PLUGIN segment: "df"
        Type: PLUGIN (0x0002)
        Length: 7
        Plugin: df
    collectd PLUGIN_INSTANCE segment: "root"
        Type: PLUGIN_INSTANCE (0x0003)
        Length: 9
        Plugin instance: root
    collectd TYPE segment: "df_complex"
        Type: TYPE (0x0004)
        Length: 15
        Type: df_complex
    collectd TYPE_INSTANCE segment: "used"
        Type: TYPE_INSTANCE (0x0005)
        Length: 9
        Type instance: used
    collectd VALUES segment: 1 value
        Type: VALUES (0x0006)
        Length: 15
        Value count: 1
        1 value
            Gauge: 2.94918e+11
                Value type: GAUGE (0x01)
                Gauge value: 294917779456
        [Assembled metric]
            Host name: elkstack.example.com
            Plugin: df
            Plugin instance: root
            Type: df_complex
            Type instance: used
            Timestamp: May  6, 2020 10:41:03.590266326 EST
            Interval: 60.000000000 seconds
    collectd TIME_HR segment: May  6, 2020 10:41:03.589932766 EST
        Type: TIME_HR (0x0008)
        Length: 12
        Timestamp: May  6, 2020 10:41:03.589932766 EST
    collectd PLUGIN segment: "memory"
        Type: PLUGIN (0x0002)
        Length: 11
        Plugin: memory
    collectd PLUGIN_INSTANCE segment: ""
        Type: PLUGIN_INSTANCE (0x0003)
        Length: 5
        Plugin instance: 
    collectd TYPE segment: "percent"
        Type: TYPE (0x0004)
        Length: 12
        Type: percent
    collectd VALUES segment: 1 value
        Type: VALUES (0x0006)
        Length: 15
        Value count: 1
        1 value
            Gauge: 38.871
                Value type: GAUGE (0x01)
                Gauge value: 38.8709877262965
        [Assembled metric]
            Host name: elkstack.example.com
            Plugin: memory
            Plugin instance: 
            Type: percent
            Type instance: used
            Timestamp: May  6, 2020 10:41:03.589932766 EST
            Interval: 60.000000000 seconds
    collectd TIME_HR segment: May  6, 2020 10:41:03.590267718 EST
        Type: TIME_HR (0x0008)
        Length: 12
        Timestamp: May  6, 2020 10:41:03.590267718 EST
    collectd PLUGIN segment: "df"
        Type: PLUGIN (0x0002)
        Length: 7
        Plugin: df
    collectd PLUGIN_INSTANCE segment: "root"
        Type: PLUGIN_INSTANCE (0x0003)
        Length: 9
        Plugin instance: root
    collectd TYPE segment: "percent_bytes"
        Type: TYPE (0x0004)
        Length: 18
        Type: percent_bytes
    collectd TYPE_INSTANCE segment: "reserved"
        Type: TYPE_INSTANCE (0x0005)
        Length: 13
        Type instance: reserved
    collectd VALUES segment: 1 value
        Type: VALUES (0x0006)
        Length: 15
        Value count: 1
        1 value
            Gauge: 5.08306
                Value type: GAUGE (0x01)
                Gauge value: 5.08306264877319
        [Assembled metric]
            Host name: elkstack.example.com
            Plugin: df
            Plugin instance: root
            Type: percent_bytes
            Type instance: reserved
            Timestamp: May  6, 2020 10:41:03.590267718 EST
            Interval: 60.000000000 seconds
    collectd TIME_HR segment: May  6, 2020 10:41:03.590267107 EST
        Type: TIME_HR (0x0008)
        Length: 12
        Timestamp: May  6, 2020 10:41:03.590267107 EST
    collectd TYPE_INSTANCE segment: "free"
        Type: TYPE_INSTANCE (0x0005)
        Length: 9
        Type instance: free
    collectd VALUES segment: 1 value
        Type: VALUES (0x0006)
        Length: 15
        Value count: 1
        1 value
            Gauge: 7.73987
                Value type: GAUGE (0x01)
                Gauge value: 7.73987340927124
        [Assembled metric]
            Host name: elkstack.example.com
            Plugin: df
            Plugin instance: root
            Type: percent_bytes
            Type instance: free
            Timestamp: May  6, 2020 10:41:03.590267107 EST
            Interval: 60.000000000 seconds
    collectd TIME_HR segment: May  6, 2020 10:41:03.590268296 EST
        Type: TIME_HR (0x0008)
        Length: 12
        Timestamp: May  6, 2020 10:41:03.590268296 EST
    collectd TYPE_INSTANCE segment: "used"
        Type: TYPE_INSTANCE (0x0005)
        Length: 9
        Type instance: used
    collectd VALUES segment: 1 value
        Type: VALUES (0x0006)
        Length: 15
        Value count: 1
        1 value
            Gauge: 87.1771
                Value type: GAUGE (0x01)
                Gauge value: 87.1770553588867
        [Assembled metric]
            Host name: elkstack.example.com
            Plugin: df
            Plugin instance: root
            Type: percent_bytes
            Type instance: used
            Timestamp: May  6, 2020 10:41:03.590268296 EST
            Interval: 60.000000000 seconds
    collectd TIME_HR segment: May  6, 2020 10:41:03.589932766 EST
        Type: TIME_HR (0x0008)
        Length: 12
        Timestamp: May  6, 2020 10:41:03.589932766 EST
    collectd PLUGIN segment: "memory"
        Type: PLUGIN (0x0002)
        Length: 11
        Plugin: memory
    collectd PLUGIN_INSTANCE segment: ""
        Type: PLUGIN_INSTANCE (0x0003)
        Length: 5
        Plugin instance: 
    collectd TYPE segment: "percent"
        Type: TYPE (0x0004)
        Length: 12
        Type: percent
    collectd TYPE_INSTANCE segment: "slab_unrecl"
        Type: TYPE_INSTANCE (0x0005)
        Length: 16
        Type instance: slab_unrecl
    collectd VALUES segment: 1 value
        Type: VALUES (0x0006)
        Length: 15
        Value count: 1
        1 value
            Gauge: 0.379304
                Value type: GAUGE (0x01)
                Gauge value: 0.379303957034613
        [Assembled metric]
            Host name: elkstack.example.com
            Plugin: memory
            Plugin instance: 
            Type: percent
            Type instance: slab_unrecl
            Timestamp: May  6, 2020 10:41:03.589932766 EST
            Interval: 60.000000000 seconds
    collectd HOST segment: "customer1.example.com"
        Type: HOST (0x0000)
        Length: 25
        Host name: customer1.example.com
    collectd TIME_HR segment: May  6, 2020 10:40:05.514224234 EST
… snip more data because forum software couldn't take it …

The data for customer1.example.com does show up, but not elkstack.example.com. I have a suspicion this is issue #13 rearing its ugly head, but I'm not sure.

Is there way I can get logstash to report what it sees from collectd?