Missing documents

I have following setup and I am missing documents. it is very hard to pinpoint at this time. that is why posting here to get some idea

1000 machine running metricbeat sending metric every min to HA proxy

Ha Proxy then sends data to 5 logstash server.

5 logstash server parse the data and send to elasticsearch cluster. Logstatsh is also creating _id because without creating _id I was getting duplicate entry.

I am stil missing data, reason I know it is missing is because of process metric.

For example user process is running for 10 min and it is sending data for few min stops and then shows up again. there is no exact timeframe or exact system that is showing this behavior. it is all random. I am trying to see some pattern but it is complete random. logstash server are not throwing any error that it didn’t ingested some data. elastic cluster is also not showing any error that it rejected anything.

Haproxy has no error either. Elastic output index is datastream.

oh shoot forgot to add versions

Elastic 8.16.1, logstash and metricbeat 8.5.3

Which version of Elasticsearch are you using? Are the other components on the same version?

How is this configured?

Exactly how is the document ID created?

It would be useful to see the Logstash config for this part.

Is it possible that you have colliding IDs that causes some data to be dropped?

Elastic 8.16.1, logstash and metricbeat 8.5.3

only talking about process metric, because only issue with them. other metrics are there all the time.

_id = jobno_hostname_pid_@timestamptimestamp ( this will make sure one id for a process per system per min)

here is extract of logstash.

   if ([system][process][cmdline]) {
      if ( "JOBNO" in [system][process][cmdline]) {
         grok {
               tag_on_failure => ["grok_parse_failed"]
               match => {"[system][process][cmdline]" =>  "%{GREEDYDATA:rm1}:%{WORD:project} %{WORD:rm2}=%{NUMBER:jobno:int}"  }
               match => {"[system][process][cmdline]" =>  "%{GREEDYDATA:rm1} %{GREEDYDATA:rm2}=%{NUMBER:jobno:int} %{GREEDYDATA:rm3}"  }
               match => {"[system][process][cmdline]" =>  "%{GREEDYDATA:rm1}:%{WORD:project} %{GREEDYDATA:rm2}=%{NUMBER:jobno:int}"  }
               match => {"[system][process][cmdline]" =>  "%{GREEDYDATA:rm1}=%{NUMBER:jobno:int}"  }
         }

         if ![project] {
            mutate {  add_field => { "[system][process][project]" => "NA" } }
         }
         else {
            mutate { add_field => { "[system][process][project]" => "%{project}" } }
         }

         # process.name has 15 char limit /proc/pid/stats, which is unix limit
         # we need full name of process hence this
         ruby { code => "event.set('process_fullname', event.get('[system][process][cmdline]').split(' ').first.split('/').last)" }
         if ([process_fullname] == "bash" or [process_fullname] =~ /python/) {
            ruby { code => "event.set('process_fullname', event.get('[system][process][cmdline]').split(' ')[1].split('/').last)" }
         }

         # create jobno and hpid
         mutate { add_field => { "[system][process][jobno]" => "%{jobno}" }
                  add_field => { "[system][process][hpid]" => "%{[host][name]}_%{[process][pid]}" }
                  add_field => { "[system][process][jhpid]" => "%{jobno}_%{[host][name]}_%{[process][pid]}" }
                  convert => { "[system][process][jobno]" => "integer" }
         }

         mutate { rename => { "[process_fullname]" => "[system][process][fullname]" }
                  rename => { "[process][pid]" => "[system][process][pid]" }
                  rename => { "[process][ppid]" => "[system][process][ppid]" }
                  rename => { "[process][pgid]" => "[system][process][pgid]" }
                  rename => { "[process][state]" => "[system][process][state]" }
         }
# jobno_hostname_pid_@timestamp should make this uniq, because there is no same pid runs at same time in Linux
         mutate { add_field => { "[@metadata][id]" => "%{[system][process][jhpid]}_%{@timestamp}" } }

      } # end of if JOBNO

   else if ([metricset][name] == "uptime" or [metricset][name] == "filesystem" or [metricset][name] == "process_summary" or [metricset][name] == "network" or [metricset][name] == "cpu" or [metricset][name] == "memory" or [metricset][name] == "diskio" or [metricset][name] == "load")
   {
       # Do some other process 
   }

There is no way it can collide with id because of pid can’t be duplicated on a same host.

Weird thing is same process (PID) is reporting few times then stops and then show up again in elasticsearch. this means my grok pattern is working fine as well.

proxy setting

frontend logstash_metricbeat
bind 10.115.128.60:5044
mode tcp
timeout client      120s
default_backend logstash_metricbeat
backend logstash_metricbeat
        mode tcp
        timeout server      100s
        timeout connect     100s
        timeout queue       100s
        balance leastconn
        server log001 10.115.128.3:5044 check weight 1
        server log002 10.115.128.4:5044 check weight 1
        server log003 10.115.128.5:5044 check weight 1
        server log004 10.115.128.6:5044 check weight 1
        server log005 10.115.128.7:5044 check weight 1

This is an interesting thread/problem.

So far I'm not sure you have made convincing case that there are "missing documents". You seem more to have inferred that, based on expectations, rather than observed same. I am likely wrong on this. But just maybe, if you elaborate on the evidence that a document was produced there, then sent there, and so on, it will help track down whats gone wrong.

e.g. this is a bit opaque to me

You are describing here what is supposed to happen, rather than demonstrating that it is actually happening. e.g. very first claim - the 1000 machines are sending process data to haproxy every minute, how did you verify that?

haproxy tcp mode is difficult to troubleshoot, as you dont have easy access to the actual individual messages. There's a fair few hacky ways, eg. "tee" the incoming traffic using iptables or port mirroring to some other host for analysis, or use socat scripts, .... Or just packet capture and try to trace all the way E2E. But logstash allows multiple outputs, so you can enable additional debugging there and try to pair with what originated on your "1000 machines".

Good luck!

1 Like

Kevin I understand but I can’t post this data as is due to proprietary data. but I can confirm I am missing data 100% sure.

this is happening as I am getting thousands of records per min.

big volume of data is what makes it very hard to debug.

for example screenshot. this shows exact same process metric reported for few min then stop and reported, and this is random. not happening to same host again for few hours or days. that is why I don’t know how to check them.

For example working process which is reporting every min

Something like this is very difficult to troubleshoot as there are several steps in the ingestion pipeline where things could go wrong.

How frequently do you see this issue? Are there specific sets of devices that are affected or any pattern around periods of time where documents go missing?

Are there any connection probelms logged from the Metricbeat side? Metricbeat typically uses long-running connections and you are using a proxy, which is not standard. Is it possible that timeouts on proxy connections could be causing issues? Would it be possible to set up a new Logstash instance (or two) and have some Metricbeats connect directly to this and see whether or not that reduces data loss from those sources?

How are you handling parsing failures in Logstash? Are there written to Elasticsearch or perhaps a file that can be investigated?

Are you able to investigate this on a single host? To track if this is an ingestion issue or a data collection issue?

In some cases, depending on the usage of the hosts, CPU, Memory etc, the meticbeat may fail to collect data, and this will lead to gaps.

1 Like

I was writing much the same in parallel. I’d wager it’s a data collection issue — “gaps” caused by incorrect collection, parsing errors, intermittent processes with special characters in cmdline, stuff like that.

With 1,000 hosts, you’re bound to have a salad of the weird and wonderful.

And again: if everything worked as described, there would be no duplicates and no need for the hand-crafted _id. But there was, meaning there’s an underlying problem the hand-crafted _id is merely working around. My gut instinct tells me thats related here too.

1 Like

This ^^^

@elasticforme It is a mistake to think that every metric is reported every time.

Especially process metrics...

Example of the Top 5 Process Changes, tagged by name, if that process is in / then out of top 5 processes you can get data that looks like this...

This is NOT actually missing data... but it does look like it...

This is just one example of how metrics in particular can appear to have gaps...

If you saw this with say CPU or Memory then I would think different

Process, Devices etc.. can come and go...

This will depend on your process collection configuration

1 Like

Thank you every one for your input. yes this is defiantly weird one to debug.

all the time but random host at random interval. when this happens I still have cpu/memory/network metric from it. and parsing can’t be failed because process is still running with same argument.

good idea. let me work on it.

can do this as well.

this was initial issue two years ago when I design this setup. hence added _id afterward once I discover duplicate. thinking that proxy might be sending same record to two logstash and it does as I can see them on logstash logs. where it says record rejected because of version conflict.

This is no only two five metric. this is particular process metric which I do parse. basically from client side I drop most all root process and many other using drop filter. process metric I am looking for it jobs that we run on our HPC.

HPC is High Performance Computing in this context? The sort where (some) engineers have tuned their applications over years to try to squeeze every last flop of performance, and the CPU/memory/IO/... can often get significantly utilized?

<full disclosure, I was once one of those HPC engineers, if my CPUs weren't maxed out I'd felt that I'd done a bad job>

yes, jobs that they run used lot of resources and hence we try to monitor them. and yes they want everything to be used. LOL

Not quite clear on this... but Can you share your metricbeat config?

System process id's come and go don't they?

yes system process id’s comes and go when job is completed.

here is metricbeat config.

#Module: system
#Docs: https: //www.elastic.co/guide/en/beats/metricbeat/8.5/metricbeat-module-system.html

 - module: system
period: 1m
metricsets:
 - cpu
 - load
 - memory
 - network
 - process

#drop all who match following using OR condition
processors:
 - drop_event.when:
 or:
 - equals.user.name: root
 - equals.user.name: postfix
 - equals.user.name: zabbix
 - equals.user.name: _lldpd
 - equals.user.name: Debian - snmp
 - equals.user.name: Debian +
 - equals.user.name: _chrony
 - equals.user.name: mpirun
 - equals.user.name: houins
 - equals.user.name: avahi
 - equals.user.name: avahi
 - equals.user.name: statd
 - equals.user.name: nslcd
 - equals.user.name: messagebus
 - equals.user.name: ntp
 - equals.user.name: qed
 - equals.user.name: tangovpm
 - equals.user.name: snmp
 - equals.user.name: snmpd
 - equals.user.name: _rpc
 - equals.user.name: statemd +
 - equals.user.name: postgres
 - equals.user.name: sshd
 - equals.user.name: www - data
 - equals.user.name: nagios
 - equals.user.name: message +
 - equals.process.name: mpirun
 - equals.process.name: sh
 - equals.process.name: csh
 - equals.process.name: ssh
 - equals.process.name: sshd
 - equals.process.name: snmpd
 - equals.process.name: less
 - equals.process.name: konsole
 - equals.process.name: kdevelop
 - equals.process.name: dbus - daemon
 - equals.process.name: xterm
 - equals.process.name: lldpd
 - equals.process.name: grep
 - equals.process.name: node
 - equals.process.name: python
 - equals.process.name: AppRun.wrapped
 - equals.process.name: p4v.bin
 - equals.process.name: akonadi_control
 - equals.process.name: service - register
 - equals.process.name: dconf - service
 - equals.process.name: at - spi2 - registryd
 - regexp.system.process.cmdline: "^/bin/csh*"
 - regexp.process.name: "^perl*"
 - regexp.process.name: "^at-spi*"
 - regexp.process.name: "^dbus-launch*"
 - equals.system.network.name: lo
 - equals.system.network.name: loop1

and as posted in logstash config. I parse anything which has system.process.cmdline exist and it has JOBNO (uppercase) exist. some job runs for days some for hours.

I am going to remove “weight 1” to roundrobin in haproxy. will see if it improves anything. ( this is gone a take day or two to validate if anything is missing)

next will redirect few hundred system directly to logstash bypassing haproxy. and check only on that batch of systems.

One thing to keep in mind is that the connection from Beats to Logstash is sticky, once connected it will not change unless the logstash endpoint has some issue and goes down.

What is your metricbeat output? Please share it.

but this is going via haproxy. beats sending data to haproxy. haproxy then forward that to logstash servers.

I don’t have access to agent to look logs. and if I want access can only get my hand on few system and that might not show this symptoms for days.

if agent is not sending metrics due to high load on system then I am out of luck testing it. :slight_smile:

I can only test/debug and find on server side assuming that agent is sending data.

From my experience this makes no difference, the connection will still be sticky, metricbeat will open a connection to haproxy and haproxy will then open a connection to one of the logstash servers, this will be sticky.

It is not the logs, but the output configuration in metricbeat.yml, how is the logstash output configured there? When using Logstash behind a load balancer there are some required configurations that you need to have for it to work as desired.

The traffic from metricbeat is just plaintext over TCP, right?

I repeat my suggestion from before, try tap the network port of the haproxy (input side), and you'll have a copy of all the traffic arriving at the haproxy. If you have a big memory machine you could tap it into a set of tmpfs (RAM) files to cover last X minutes/hours/... Then wait.

If this is the case, you should be able to see this fairly quickly, if the "missing" packets are frequent enough (if that's not an oxymoron!).