Filebeat not sending most logs across the network

I changed the output to console and verified that filebeat is actually examining the logs and there are hundreds of lines that should be sent to Logstash.
The only error messages that journalctl shows are about empty files being ignored.
I've verified with tcpdump that over 30 seconds go by without any messages being sent.
I upgraded from 7.17.16 to 8.12.0 but that had no effect.

There was a burst of traffic at 00:00 UTC of nearly 14k lines in 1 minute, which is a normal amount of traffic, but the next minute it was back down to 210 lines.

Where else can I look to diagnose this?
The problem only started on Wednesday, though I can't find any change that might have stopped the flow of logs. The configs are working fine in another DC, but 1 DC that we have is missing millions of log lines.

The log from filebeat would be useful here, it publishes a non-zero metrics in last 30s message which contains useful info like events in queue, events in output, etc that might help determine where the gap is coming from. Is the filebeat log available for review?

 "message": "Non-zero metrics in the last 30s",
  "service.name": "filebeat",
  "monitoring": {
    "metrics": {
      "beat": {
        "cgroup": {
          "memory": {
            "mem": {
              "usage": {
                "bytes": 202440704
              }
            }
          }
        },
        "cpu": {
          "system": {
            "ticks": 2000,
            "time": {
              "ms": 260
            }
          },
          "total": {
            "ticks": 7790,
            "time": {
              "ms": 1260
            },
            "value": 7790
          },
          "user": {
            "ticks": 5790,
            "time": {
              "ms": 1000
            }
          }
        },
        "handles": {
          "limit": {
            "hard": 524288,
            "soft": 524288
          },
          "open": 44
        },
        "info": {
          "ephemeral_id": "2974ea17-de14-4f67-a46d-6056c35daa72",
          "uptime": {
            "ms": 150298
          },
          "version": "8.12.0"
        },
        "memstats": {
          "gc_next": 81762784,
          "memory_alloc": 75316784,
          "memory_total": 919995000,
          "rss": 161542144
        },
        "runtime": {
          "goroutines": 149
        }
      },
      "filebeat": {
        "events": {
          "active": 3203,
          "added": 22401,
          "done": 22400
        },
        "harvester": {
          "open_files": 32,
          "running": 32
        }
      },
      "libbeat": {
        "config": {
          "module": {
            "running": 0
          }
        },
        "output": {
          "events": {
            "acked": 22400,
            "active": 3200,
            "batches": 14,
            "total": 22400
          },
          "read": {
            "bytes": 84
          },
          "write": {
            "bytes": 3446852
          }
        },
        "pipeline": {
          "clients": 32,
          "events": {
            "active": 3203,
            "published": 22400,
            "total": 22401
          },
          "queue": {
            "acked": 22400
          }
        }
      },
      "registrar": {
        "states": {
          "current": 0
        }
      },
      "system": {
        "load": {
          "1": 24.5,
          "15": 22.33,
          "5": 23.43,
          "norm": {
            "1": 0.7656,
            "15": 0.6978,
            "5": 0.7322
          }
        }
      }
    },
    "ecs.version": "1.6.0"
  }
}

It looks like Filebeat things everything is fine, but tcpdump disagrees.

I'm not too fluent in filebeat performance metric data but it does look to me like filebeat considers there to be 3200 active events in the output which could imply it's waiting for an ACK from logstash.

Can you share two consecutive non-zero metrics messages?

Can you share the Filebeat memory queue and output configuration (it looks like those might be set to non default values)?

Does the filebeat log contain any timeouts, disconnects, warnings or errors? Does Logstash show any warnings or errors?

Filebeat and Logstash are not reporting any errors.
I have not configured any memory settings in Filebeat.
Output config is very simple.

output.logstash:
  # The Logstash hosts
  hosts:
    - "logstash-0000:5044"
    - "logstash-0001:5044"
    - "logstash-0002:5044"

The next Non-zero message:

 "message": "Non-zero metrics in the last 30s",
  "service.name": "filebeat",
  "monitoring": {
    "metrics": {
      "beat": {
        "cgroup": {
          "memory": {
            "mem": {
              "usage": {
                "bytes": 198402048
              }
            }
          }
        },
        "cpu": {
          "system": {
            "ticks": 2250,
            "time": {
              "ms": 250
            }
          },
          "total": {
            "ticks": 9070,
            "time": {
              "ms": 1280
            },
            "value": 9070
          },
          "user": {
            "ticks": 6820,
            "time": {
              "ms": 1030
            }
          }
        },
        "handles": {
          "limit": {
            "hard": 524288,
            "soft": 524288
          },
          "open": 44
        },
        "info": {
          "ephemeral_id": "2974ea17-de14-4f67-a46d-6056c35daa72",
          "uptime": {
            "ms": 180297
          },
          "version": "8.12.0"
        },
        "memstats": {
          "gc_next": 87276472,
          "memory_alloc": 78972312,
          "memory_total": 1072643064,
          "rss": 154451968
        },
        "runtime": {
          "goroutines": 149
        }
      },
      "filebeat": {
        "events": {
          "active": 3206,
          "added": 22403,
          "done": 22400
        },
        "harvester": {
          "open_files": 32,
          "running": 32
        }
      },
      "libbeat": {
        "config": {
          "module": {
            "running": 0
          }
        },
        "output": {
          "events": {
            "acked": 22400,
            "active": 3200,
            "batches": 14,
            "total": 22400
          },
          "read": {
            "bytes": 84
          },
          "write": {
            "bytes": 3418213
          }
        },
        "pipeline": {
          "clients": 32,
          "events": {
            "active": 3206,
            "published": 22400,
            "total": 22403
          },
          "queue": {
            "acked": 22400
          }
        }
      },
      "registrar": {
        "states": {
          "current": 0
        }
      },
      "system": {
        "load": {
          "1": 25.46,
          "15": 22.48,
          "5": 23.78,
          "norm": {
            "1": 0.7956,
            "15": 0.7025,
            "5": 0.7431
          }
        }
      }
    },
    "ecs.version": "1.6.0"
  }
}

I think looking at debug logs might shine some light on what's going on but that is probably best done over a support ticket if possible.

If you can sanitize the debug logs you can post them here but a support ticket might be the next step here.

Unfortunately we don't have a support contract right now.
How helpful are the debug logs if I strip out all of the actual log lines that should be sent? Those that are {"log.level":"debug","@timestamp":"2024-01-22T20:32:01.964Z","log.logger":"processors","log.origin":{"function":"github.com/elastic/beats/v7/libbeat/publisher/processing.debugPrintProcessor.func1","file.name":"processing/processors.go","file.line":213}

It might be useful, if you're willing to do that I can take a look at the resulting log.

You could also try to use a sample log that is not sensitive and see if you can reproduce the same issue.

Can you share your Filebeat config stripped of any sensitive info as well?

My Filebeat config can be found at https://f000.backblazeb2.com/file/filebeat-missing-logs/filebeat.yml

The logs can be found at https://f000.backblazeb2.com/file/filebeat-missing-logs/sanitized.ndjson

That results in a 401 unauthorized. I think you should be able to attach the files to your comment.

Oops, wrong bucket permissions, they should be fixed now.

This log is for about 20s, are you saying that it no longer logs anything after this timespan and stops scraping for logs?

No, that was just 1 file that got rolled over, because of the volume of logs being processed. That's the biggest mystery, logs are being recognized by Filebeat, but aren't being sent to Logstash.

I do see many references that Beats is sending data to logstash, in that 20s log I do see 4800 events being sent to the 0002 node without any errors.

If you run a live tcpdump do you really not see any traffic going to Logstash? If you run tcpdump on the Logstash server do you see the incoming traffic?

I see some traffic making it to Logstash, just not as much as I expect. In that timeframe only 2997 logs actually made it to Logstash

If you look at the TCP dump do you see all 4800 events in the tcpdump? There are no errors related to the Logstash output in the debug logs.

Have you determined that it isn't an issue with Logstash dropping messages?

Getting an odd number of logs in logstash is very suspicious if Filebeat is sending in batches of 1600.

It is unencrypted traffic, is there any network inspection or proxy between Filebeat and logstash?

We don't have any proxies between the Filebeat hosts and the Lostash hosts.
I don't believe that it's a problem with Logstash dropping messages because some of our hosts are working fine.

We have checked the switches that both the Logstash servers are on and that the API host I've been focusing on is connected to. No errors reported in over a month of uptime on their part.

I think a filtered tcpdump from the Filebeat client and the logstash server would tell you whether or not the events are actually leaving the Filebeat client and making it to the logstash server.

You can even count the events by looking at the body of the message in the tcpdump to see if the correct number of events are actually leaving Filebeat.

Filebeat is making a single large http request to logstash for each batch being published so the likelihood that Filebeat thinks it's publishing 3200 events but only 2000 are actually in the request is pretty low. The TCPdump would help you figure out where along the line the events go missing.