Missing documents

elasticforme · August 13, 2025, 6:55pm

I have posted metricbeat configuration on second post. but here it is

Proxy server setting (haproxy)

frontend logstash_metricbeat
bind 10.115.128.60:5044
mode tcp
timeout client      120s
default_backend logstash_metricbeat

backend logstash_metricbeat
        mode tcp
        timeout server      100s
        timeout connect     100s
        timeout queue       100s
        balance leastconn
        server log001 10.115.128.3:5044 check weight 1
        server log002 10.115.128.4:5044 check weight 1
        server log003 10.115.128.5:5044 check weight 1
        server log004 10.115.128.6:5044 check weight 1
        server log005 10.115.128.7:5044 check weight 1

logstash setting

input {
beats {
port => "5044"
}
}

filter {
   if ([system][process][cmdline]) {
      if ( "JOBNO" in [system][process][cmdline]) {
         grok {
               tag_on_failure => ["grok_parse_failed"]
               match => {"[system][process][cmdline]" =>  "%{GREEDYDATA:rm1}:%{WORD:project} %{WORD:rm2}=%{NUMBER:jobno:int}"  }
               match => {"[system][process][cmdline]" =>  "%{GREEDYDATA:rm1} %{GREEDYDATA:rm2}=%{NUMBER:jobno:int} %{GREEDYDATA:rm3}"  }
               match => {"[system][process][cmdline]" =>  "%{GREEDYDATA:rm1}:%{WORD:project} %{GREEDYDATA:rm2}=%{NUMBER:jobno:int}"  }
               match => {"[system][process][cmdline]" =>  "%{GREEDYDATA:rm1}=%{NUMBER:jobno:int}"  }
         }

         if ![project] {
            mutate {  add_field => { "[system][process][project]" => "NA" } }
         }
         else {
            mutate { add_field => { "[system][process][project]" => "%{project}" } }
         }

         # process.name has 15 char limit /proc/pid/stats, which is unix limit
         # we need full name of process hence this
         ruby { code => "event.set('process_fullname', event.get('[system][process][cmdline]').split(' ').first.split('/').last)" }
         if ([process_fullname] == "bash" or [process_fullname] =~ /python/) {
            ruby { code => "event.set('process_fullname', event.get('[system][process][cmdline]').split(' ')[1].split('/').last)" }
         }

         # create jobno and hpid
         mutate { add_field => { "[system][process][jobno]" => "%{jobno}" }
                  add_field => { "[system][process][hpid]" => "%{[host][name]}_%{[process][pid]}" }
                  add_field => { "[system][process][jhpid]" => "%{jobno}_%{[host][name]}_%{[process][pid]}" }
                  convert => { "[system][process][jobno]" => "integer" }
         }

         mutate { rename => { "[process_fullname]" => "[system][process][fullname]" }
                  rename => { "[process][pid]" => "[system][process][pid]" }
                  rename => { "[process][ppid]" => "[system][process][ppid]" }
                  rename => { "[process][pgid]" => "[system][process][pgid]" }
                  rename => { "[process][state]" => "[system][process][state]" }
         }
# jobno_hostname_pid_@timestamp should make this uniq, because there is no same pid runs at same time in Linux
         mutate { add_field => { "[@metadata][id]" => "%{[system][process][jhpid]}_%{@timestamp}" } }

      } # end of if JOBNO

   else if ([metricset][name] == "uptime" or [metricset][name] == "filesystem" or [metricset][name] == "process_summary" or [metricset][name] == "network" or [metricset][name] == "cpu" or [metricset][name] == "memory" or [metricset][name] == "diskio" or [metricset][name] == "load")
   {
       # Do some other process 
   }
 mutate { add_field => { "[@metadata][target_index]" => "myindexname-%{[@metadata][version]}"  } }
}

elasticsearch {
     hosts => [list of hosts]
     index => "%{[@metadata][target_index]}"
     document_id => "%{[@metadata][id]}"
     action => "create"
     user => "${elastic_admin_user}"
     password => "${elastic_admin_password}"
}

leandrojmp · August 14, 2025, 12:06am

Where? I could not find the output configuration of your metricbeat, you posted your logstash pipeline and some other metricbeat configuration, but not the configuration that was asked.

Can you share the entire output configuration from your metricbeat.yml? It is the part where you have output.logstash etc.

elasticforme · August 14, 2025, 6:40pm

it is very straight forward

metricbeat.config.modules:
path: ${path.config}/modules.d/system.yml
reload.enabled: false
output.logstash:
hosts: ["elkproxy:5044"]

logging.to_files: true
logging.files:
path: /s0/log/metricbeat
name: metricbeat
keepfiles: 7
permissions: 0644

leandrojmp · August 14, 2025, 8:24pm

Yeah, you need to add a ttl setting to force the metricbeats instance to refresh the connection and try to get a more even distribution since your logstash instaces are behind haproxy.

From the documentation:

Time to live for a connection to Logstash after which the connection will be re-established. Useful when Logstash hosts represent load balancers. Since the connections to Logstash hosts are sticky, operating behind load balancers can lead to uneven load distribution between the instances. Specifying a TTL on the connection allows to achieve equal connection distribution between the instances. Specifying a TTL of 0 will disable this feature.

You would need to have something like this in your output:

output.logstash:
  hosts: ["elkproxy:5044"]
  pipelining: 0
  loadbalance: false
  ttl: 2m

elasticforme · August 14, 2025, 8:34pm

Perfect, tomorrow I am going to try changing wright 1 to round robin on haproxy

then for this change I have to put change control for operation team.

Now I have few thing to test it out. I will report my finding. one by one.

try round robin in haproxy.
use ttl=xm and loadbalance=false in metricbeat.yml on agent
redirect few 100s host directly to logstash server bypassing haproxy.

Thanks every one for giving good pointers.

RainTown · August 15, 2025, 8:04am

Good luck.

Note if you have a collection issue, ie the process info gaps are from intermittently uncollected process data, then this won’t help directly. If there are gaps in only the process data, but not say CPU load or memory usage from exact same clients, then that would be quite suggestive of a collection issue rather than something you can fix via configuration.

elasticforme · August 15, 2025, 6:37pm

Kevin I hear you. but have to validate it.

I reset the haproxy but didn’t made any changes.

is there is limit in metricbeat that can send metrics?

I am getting somewhere now

I check on system and there is 21 job running
I copy all 21 cmdline and run on grok debugger on Kibana dev tool,  one by one and they all got parsed

 date ; ps -ef | grep JOBNO | awk -F"JOBNO=" '{print $NF}' | sort -u
Fri Aug 15 13:35:43 CDT 2025

179438798
179439134
180053850
180053864
180093268
180226715
180241965
180242811
180243010
180243383
180243613
180243840
180244019
180244045
180258257
180259230
180259401
180259424
180261138
180264878

But same time elastic is only showing 6 of them. where are the other

elasticforme · August 15, 2025, 7:17pm

this is my full system.yml. is this means I am only getting top 5 process by cpu and 5 for memory?

I was under impression that this setting is for process_summary only

cat system.yml
# Module: system
# Docs: https://www.elastic.co/guide/en/beats/metricbeat/7.9/metricbeat-module-system.html
 
- module: system
  period: 1m
  metricsets:
    - cpu
    - load
    - memory
    - network
    - process
    - process_summary
    - users
  process.include_cpu_ticks: true
  process.include_top_n:
    by_cpu: 5      # include top 10 processes by CPU
    by_memory: 5   # include top 10 processes by memory
 
# drop all who match following using OR condition
    - drop_event.when:
        or:
          - equals.user.name: postfix
          - equals.user.name: zabbix
          - equals.user.name: _lldpd
          - equals.user.name: Debian-snmp
          - equals.user.name: Debian+
          - equals.user.name: _chrony
          - equals.user.name: mpirun
          - equals.user.name: statd
          - equals.user.name: nslcd
          - equals.user.name: messagebus
          - equals.user.name: ntp
          - equals.user.name: qed
          - equals.user.name: tangovpm
          - equals.user.name: snmp
          - equals.user.name: snmpd
          - equals.user.name: _rpc
          - equals.user.name: statemd+
          - equals.user.name: postgres
          - equals.user.name: sshd
          - equals.user.name: www-data
          - equals.user.name: root
          - equals.user.name: message+
          - equals.process.name: mpirun
          - equals.process.name: csh
          - equals.process.name: sh
          - equals.process.name: ssh
          - equals.process.name: sshd
          - equals.process.name: snmpd
          - equals.process.name: less
          - equals.process.name: dbus-daemon
          - equals.process.name: xterm
          - equals.process.name: lldpd
          - equals.process.name: grep
          - equals.process.name: service-register
          - equals.process.name: dconf-service
          - equals.process.name: at-spi2-registryd
          - regexp.process.name: "^dbus-launch*"
          - equals.system.network.name: lo
          - equals.system.network.name: loop1

elasticforme · August 15, 2025, 8:02pm

removed these three line on a system, restarted metricbeat and I can now see all the jobs. I think this was the issue. becuase it was only sending top X process and sometime it was missing because process might be sleeping and not on top X. make all sense now.

process.include_top_n:
    by_cpu: 5      # include top 10 processes by CPU
    by_memory: 5   # include top 10 processes by memory

will monitor for a week as I have thousands of metric per min, if all good then Architecture is good and in working as expected.

Thank you all for your input, @RainTown @leandrojmp @stephenb @Christian_Dahlqvist I was trying to figure this out from few month on-off. but seriously checked it this week.

stephenb · August 15, 2025, 8:50pm

@elasticforme

Glad you found it...

This is often confused... I have helped many others with this functionality is not always clear ...

Turning on ALL the process can significantly increase the number of metrics you collect

I suggest you look CLOSELY at these settings as they are easily miss-understood

Example

process.include_top_n.enabled
Set to false to disable the top N feature and include all processes, regardless of the other options. The default is true, but nothing is filtered unless one of the other options (by_cpu or by_memory) is set to a non-zero value.

You removed the two lines that disable filtering, but the Top N feature is still enabled without filtering—this is a minor distinction. I use this setting when I want to include the filter while being able to toggle the Top N feature on or off.

There are whitelist options and other methods for filtering available.

Additionally, you can actually apply more than one metricset definition, allowing you to have both Top N and a specific named set at the same time. Please refer to this thread for more information: Metricbeat doesn't recognize the process - #4 by stephenb.

Example

- module: system
  period: 10s
  metricsets: ['process']
  process.include_top_n:
    by_cpu: 5      # include top 5 processes by CPU
    by_memory: 5   # include top 5 processes by memory

- module: system
  period: 10s
  metricsets: ['process']
  processes: ['^sysmon*']

So, take some time to think about what you really need, and we can likely help you customize it to your requirements.

RainTown · August 15, 2025, 8:57pm

Glad you sorted it. Key line turned out to be from @stephenb - ”This will depend on your process collection configuration”.

This thread somehow reminds me of an old saying: “A mathematician is a blind man in a dark room looking for a black hat. Which isn’t there”.

(More common version involves a philosopher and a )

elasticforme · August 17, 2025, 7:52pm

I am not going to take all process as I have drop_event and all “root” process gets drops including few other.

But yes next week I am going to test this new configuration test out on few different type of servers and then push it out to all systems. I get 300+ million doc daily hence absolutely I don’t want any more.

elasticforme · August 21, 2025, 7:34pm

Last update. I have tested this and it works fine. already on way to all other host.

added some more drop event and bump up top_n to 20. that way even if something goes wrong I don’t get more then 20 top cpu, 20 top memory (max 40 process) at any time.

Next is on proxy and find out why I am getting duplicate event on multiple logstash.

Topic		Replies	Views
Metricsets are executed more than the period specified in config Beats metricbeat	14	2050	June 1, 2018
Metricbeats timeout Beats metricbeat	15	4847	March 13, 2017
Event loss on way rsyslog-> LS -> ES Elasticsearch	24	1849	July 5, 2017
Metricbeat process dropping Beats metricbeat	3	688	August 10, 2017
Metricbeat data gets into logstash but cannot index logs Beats metricbeat	9	1614	September 14, 2017

Missing documents

Related topics