Google Compute Engine systemd-resolved errors

I've just installed packetbeats on our Google Compute Engine, and I have been watching a days worth of visualizations.

The visualization for "Errors vs successful transactions [Packetbeat] ECS" has a much higher error rate than I expected, around 40-60% errors. The majority of which are from process.name: systemd-resolved with two error.messages

  • Another query with the same DNS ID from this client was received so this query was closed without receiving a response
  • Response: received without an associated Query

My google-fu is failing me and I can find nothing with either of these two error messages.

Has anyone else noticed similar problems?
Is this something wrong with my configuration of packetbeats?
Or is this something I need to investigate systemd-resolved for?

Cheers
Barrie

Based on your description of the two errors I suspect that the the systemd-resolved is re-sending DNS requests (maybe if it doesn't receive a DNS response within some timeout). Packetbeat's DNS protocol analyzer has a pretty simple state machine. It receives a request and expects a response. In this case it gets a request, another request (that overwrites the first based on request ID), a DNS response, then another response to which Packetbeat no longer has a request to match it with since the previous response closed the state for that ID.

If you capture a PCAP trace of the DNS traffic on port 53 (sudo tcpdump -i <interface> -w dns-capture.pcap port 53) you'll be able to confirm this by looking at the request and response traffic in a tool like Wireshark. If the PCAP data does not prove this theory then it could some other issue like dropped packets in which case we can look at optimizing your configuration (like using af_packet as shown here and a few other things).

I think it would be possible to improve the state machine in Packetbeat's DNS protocol analyzer to avoid these errors by adding a counter to track the number of pending requests.

Thanks.

Its been an eon since I've done pcap debugging.

So I've grabbed everything from any interface sudo tcpdump -i any -w dns-capture.pcap port 53 over a 10 minute window.

Here is a small sample of that file

It doesn't look like its resending the query, I can see the response match up for every request.
It is however sequentially asking for resolution of the same name quite quickly, why it isn't caching these queries is another question.

Now, I think I have the same information displayed in Kibana.

You can see that packetbeats isn't able to group these correctly.

I am already using these optimization values

packetbeat.interfaces.type: af_packet
packetbeat.interfaces.buffer_size_mb: 100 

How do I look for dropped packets in packetbeats?
Or what should I investigate next.

Cheers

What version of Packetbeat? What OS version and kernel version is it?

Can you run test using pcap instead of af_packet?

$ sudo packetbeat version
packetbeat version 7.1.1 (amd64), libbeat 7.1.1 [3358d9a5a09e3c6709a2d3aaafde628ea34e8419 built 2019-05-23 13:15:09 +0000 UTC]

$ uname -a
Linux my-machine 4.15.0-1034-gcp #36-Ubuntu SMP Thu Jun 6 14:47:38 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I've switched to packetbeat.interfaces.type: pcap and restarted packetbeat, will get back to you about errors.

The change has been in for almost 20 minutes and the visualization [Packetbeat] DNS Overview ECS show 0 errors.

This would indicate it works with pcap but not with af_packet.

This sounds like https://github.com/elastic/beats/issues/621.

Yes, it looks like it might be the same issue.

Unfortunately I don't have the technical background to do anything about it.

I'm happy to try running debug versions of packetbeats to check whether it fixes it.

Our setup on GCP is relatively new, we are running in zone australia-southeast1-b on an n1-standard-2 (2 vCPUs, 7.5 GB memory) so I expect that spinning up a compute engine with the right OS will be able to reproduce the problem.

As I've got a workaround, running pcap, I'll live with that for now.

Thanks for your help.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.