Hello! I have noticed in the past few weeks my Metricbeat 7.15.1 monitoring doesn't seem to work. I finally figured out the service won't stay started properly. I was using Fedora 34 and have currently rolled out a new server to replace the old using Fedora 35. When I run 'sudo metricbeat' I receive the following output. Has anyone else seen anything like this? There is way more information, but I'm at the character limit. Please let me know if I need to provide any more information. Thanks!
Not being able to create a POSIX thread (pThread) on Linux seems to be a severe problem. I would start asking you to provide some better details from your environment, such as:
CPU architecture
Operating System
Kernel Details
Metricbeat distribution
Also, could you try to execute Metricbeat this way and provide whatever is shown:
metricbeat -e -d "*"
You can paste the content in a Gist entry and provide the link here.
How's this problem is reproducible? As in, every single time you run, it fails, or is it something that happens eventually after multiple executions and restarts?
I wonder if you are not running out of resources and, hence, unable to create POSIX threads. Could you please run ulimit -a and share the results here?
I've seen some crazy VM policies being applied to guest O.S that caps their ability to create file descriptors, processes, and open files. It is important to rule that out first.
It's very strange - I upped the amount of RAM allocated to the machine in total and it somewhat fixes it, but not really? It now has a total of 10GB, 4 used by ES, 1.5 used by Kibana, and the rest available to the system. If I reboot the system, metricbeat starts up properly and just complains (for some reason) it can't access localhost:9200. If I restart the service it crashes every time with the huge, long error. This happens until I reboot the server and the cycle starts again.
Here is the output of ulimit -a:
real-time non-blocking time (microseconds, -R) unlimited
core file size (blocks, -c) unlimited
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 39385
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) unlimited
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 39385
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited
In a sense, it is "good" to know that everything works if you cycle everything up. It means that it has to be with resource contention and/or starvation. Something is keeping the processes to give back to the O.S the resources they use. Unfortunately, this also means that identifying what is causing this is a bit hard.
I would start by isolating the processes. If possible, leave only Metricbeat running in this VM and move Elasticsearch and Kibana to others. Se how this affects resource consumption. Then, keep an eye on the status of ulimit -a as you play with Metricbeat. I would start looking specifically for the "open files" field — which currently is capped to 1024.
See if you can isolate what causes this starvation and we can move from there.
I created a brand new server, Fedora 35, and set up the Elastic repo and installed only Metricbeat. That server has 4 cores, 4GB RAM, so it's got plenty of resources for just Metricbeat. It looks a little better, but I caught it a few times with similar errors:
So maybe I am missing something completely but Fedora is not even a supported OS according to our support matrix here but I understand they are very similar to CentOS
Curious if you have tried starting both with and without sudo?
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.