Metricbeat 7.15.1 Will Not Start

Hello! I have noticed in the past few weeks my Metricbeat 7.15.1 monitoring doesn't seem to work. I finally figured out the service won't stay started properly. I was using Fedora 34 and have currently rolled out a new server to replace the old using Fedora 35. When I run 'sudo metricbeat' I receive the following output. Has anyone else seen anything like this? There is way more information, but I'm at the character limit. Please let me know if I need to provide any more information. Thanks!

sudo metricbeat
runtime/cgo: pthread_create failed: Operation not permitted
SIGABRT: abort
PC=0x7f5516bbf85c m=7 sigcode=18446744073709551610

goroutine 0 [idle]:
runtime: unknown pc 0x7f5516bbf85c
stack: frame={sp:0x7f54ed465790, fp:0x0} stack=[0x7f54ecc661e8,0x7f54ed465de8)
00007f54ed465690:  00005646d9251e88 <runtime.adjustframe+136>  00007f54ed465a10
00007f54ed4656a0:  00007f54ed465b18  00007f54ed465b01
00007f54ed4656b0:  0000000000000000  0000000000000000
00007f54ed4656c0:  0000000000000000  0000000000000000
00007f54ed4656d0:  0000000000000000  0000000000000000
00007f54ed4656e0:  0000000000000000  00005646df7664e0
00007f54ed4656f0:  00005646000233d7  0000000000000000
00007f54ed465700:  00007f54ed465878  00005646d9267b01 <runtime.markroot.func1+385>
00007f54ed465710:  0000000000000000  00005646df7664e0
00007f54ed465720:  00005646dec920a0  00007f5400000000
00007f54ed465730:  0000000000000000  00007f5400000000
00007f54ed465740:  0000000000000000  00007f54ed465a68
00007f54ed465750:  00005646d925fe55 <runtime.gentraceback+4405>  00007f54ed465a10
00007f54ed465760:  00007f54ed465b00  00005646d9273001 <runtime.memhash32+65>
00007f54ed465770:  00007f54ed465878  00007f5516bcce59
00007f54ed465780:  0000000000000000  00007f5516bbf84e
00007f54ed465790: <0000000000000130  0000000100000000
00007f54ed4657a0:  0000000000000120  0000000000000000
00007f54ed4657b0:  0000000000000013  0000000000000000
00007f54ed4657c0:  000000c000071fd0  0000000000000004
00007f54ed4657d0:  0000003400000013  00007f5516c426f1
00007f54ed4657e0:  00007f54cffff640  00007f54ed465aa0
00007f54ed4657f0:  00007f54ed46591e  00007f54ed46591f
00007f54ed465800:  00007f54cffff640  00007f5516bbd785
00007f54ed465810:  00007f54d0000020  66f54dee55c2d700
00007f54ed465820:  00007f54ed466640  0000000000000006
00007f54ed465830:  00000000000000f1  0000000000000000
00007f54ed465840:  00005646dc71df7e  00007f5516b726b6
00007f54ed465850:  00007f5516d2c990  00007f5516b5c7d3
00007f54ed465860:  0000000000000020  0000000000000000
00007f54ed465870:  0000000000000000  00007f5516c31904
00007f54ed465880:  00007f54cf7ff000  000000000000000d
runtime: unknown pc 0x7f5516bbf85c
stack: frame={sp:0x7f54ed465790, fp:0x0} stack=[0x7f54ecc661e8,0x7f54ed465de8)
00007f54ed465690:  00005646d9251e88 <runtime.adjustframe+136>  00007f54ed465a10
00007f54ed4656a0:  00007f54ed465b18  00007f54ed465b01
00007f54ed4656b0:  0000000000000000  0000000000000000
00007f54ed4656c0:  0000000000000000  0000000000000000
00007f54ed4656d0:  0000000000000000  0000000000000000
00007f54ed4656e0:  0000000000000000  00005646df7664e0
00007f54ed4656f0:  00005646000233d7  0000000000000000
00007f54ed465700:  00007f54ed465878  00005646d9267b01 <runtime.markroot.func1+385>
00007f54ed465710:  0000000000000000  00005646df7664e0
00007f54ed465720:  00005646dec920a0  00007f5400000000
00007f54ed465730:  0000000000000000  00007f5400000000
00007f54ed465740:  0000000000000000  00007f54ed465a68
00007f54ed465750:  00005646d925fe55 <runtime.gentraceback+4405>  00007f54ed465a10
00007f54ed465760:  00007f54ed465b00  00005646d9273001 <runtime.memhash32+65>
00007f54ed465770:  00007f54ed465878  00007f5516bcce59
00007f54ed465780:  0000000000000000  00007f5516bbf84e
00007f54ed465790: <0000000000000130  0000000100000000
00007f54ed4657a0:  0000000000000120  0000000000000000
00007f54ed4657b0:  0000000000000013  0000000000000000
00007f54ed4657c0:  000000c000071fd0  0000000000000004
00007f54ed4657d0:  0000003400000013  00007f5516c426f1
00007f54ed4657e0:  00007f54cffff640  00007f54ed465aa0
00007f54ed4657f0:  00007f54ed46591e  00007f54ed46591f
00007f54ed465800:  00007f54cffff640  00007f5516bbd785
00007f54ed465810:  00007f54d0000020  66f54dee55c2d700
00007f54ed465820:  00007f54ed466640  0000000000000006
00007f54ed465830:  00000000000000f1  0000000000000000
00007f54ed465840:  00005646dc71df7e  00007f5516b726b6
00007f54ed465850:  00007f5516d2c990  00007f5516b5c7d3
00007f54ed465860:  0000000000000020  0000000000000000
00007f54ed465870:  0000000000000000  00007f5516c31904
00007f54ed465880:  00007f54cf7ff000  000000000000000d

rax    0x0
rbx    0x7f54ed466640
rcx    0x7f5516bbf85c
rdx    0x6
rdi    0xa75
rsi    0xa7b
rbp    0xa7b
rsp    0x7f54ed465790
r8     0x7f54ed465860
r9     0x7f5516ce64e0
r10    0x8
r11    0x246
r12    0x6
r13    0x0
r14    0x5646dc71df7e
r15    0x0
rip    0x7f5516bbf85c
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

Not being able to create a POSIX thread (pThread) on Linux seems to be a severe problem. I would start asking you to provide some better details from your environment, such as:

  • CPU architecture
  • Operating System
  • Kernel Details
  • Metricbeat distribution

Also, could you try to execute Metricbeat this way and provide whatever is shown:

metricbeat -e -d "*"

You can paste the content in a Gist entry and provide the link here.

Thanks,

@riferrei

Thanks!

CPU Architecture: (Virtual Machine) AMD Ryzen 5 2400G
Operating System: Fedora 35
Kernel: 5.14.14-300.fc35.x86_64
Metricbeat distribution: 7.15.1, installed from the Elastic repo

Here is the Gist dump:

How's this problem is reproducible? As in, every single time you run, it fails, or is it something that happens eventually after multiple executions and restarts?

I wonder if you are not running out of resources and, hence, unable to create POSIX threads. Could you please run ulimit -a and share the results here?

I've seen some crazy VM policies being applied to guest O.S that caps their ability to create file descriptors, processes, and open files. It is important to rule that out first.

It's very strange - I upped the amount of RAM allocated to the machine in total and it somewhat fixes it, but not really? It now has a total of 10GB, 4 used by ES, 1.5 used by Kibana, and the rest available to the system. If I reboot the system, metricbeat starts up properly and just complains (for some reason) it can't access localhost:9200. If I restart the service it crashes every time with the huge, long error. This happens until I reboot the server and the cycle starts again.

Here is the output of ulimit -a:

real-time non-blocking time  (microseconds, -R) unlimited
core file size              (blocks, -c) unlimited
data seg size               (kbytes, -d) unlimited
scheduling priority                 (-e) 0
file size                   (blocks, -f) unlimited
pending signals                     (-i) 39385
max locked memory           (kbytes, -l) 64
max memory size             (kbytes, -m) unlimited
open files                          (-n) 1024
pipe size                (512 bytes, -p) 8
POSIX message queues         (bytes, -q) 819200
real-time priority                  (-r) 0
stack size                  (kbytes, -s) 8192
cpu time                   (seconds, -t) unlimited
max user processes                  (-u) 39385
virtual memory              (kbytes, -v) unlimited
file locks                          (-x) unlimited

In a sense, it is "good" to know that everything works if you cycle everything up. It means that it has to be with resource contention and/or starvation. Something is keeping the processes to give back to the O.S the resources they use. Unfortunately, this also means that identifying what is causing this is a bit hard.

I would start by isolating the processes. If possible, leave only Metricbeat running in this VM and move Elasticsearch and Kibana to others. Se how this affects resource consumption. Then, keep an eye on the status of ulimit -a as you play with Metricbeat. I would start looking specifically for the "open files" field — which currently is capped to 1024.

See if you can isolate what causes this starvation and we can move from there.

@riferrei

I created a brand new server, Fedora 35, and set up the Elastic repo and installed only Metricbeat. That server has 4 cores, 4GB RAM, so it's got plenty of resources for just Metricbeat. It looks a little better, but I caught it a few times with similar errors:

So maybe I am missing something completely but Fedora is not even a supported OS according to our support matrix here but I understand they are very similar to CentOS

Curious if you have tried starting both with and without sudo?

Which modules have you enabled?

Yes, sir - Fedora is very similar to CentOS and RHEL.

Without sudo, I am unable to start the service at all due to permissions - sudo is required for services to be started or stopped for my system.

No modules enabled, save for what is default.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.