I am using packetbeat to dump mysql,http and redis request using a port mirror. I can see a lot of variation in the response time being set and many of my packets are getting lost. Can, some one guide me to use the right setting. 10K events are being generated per second from all the servers which are mirrored to a dedicated resource in network.
I am not pretty sure about the mechanism being used, but we are using port mirror for sure. Will update the mechanism after having a word with networking team. I could see these logs
2016/04/24 05:20:13.750355 http.go:421: WARN Response from unknown transaction. Ingoring. 2016/04/24 05:20:13.791735 http.go:421: WARN Response from unknown transaction. Ingoring. 2016/04/24 05:20:13.836167 http.go:421: WARN Response from unknown transaction. Ingoring. 2016/04/24 05:20:14.173077 tcp.go:147: WARN Gap in tcp stream. last_seq: 4268885803, seq: 4268890190, gap: 4387 2016/04/24 05:20:14.173121 mysql.go:644: WARN Response from unknown transaction. Ignoring. 2016/04/24 05:20:14.173252 tcp.go:147: WARN Gap in tcp stream. last_seq: 2783379924, seq: 2783379943, gap: 19 2016/04/24 05:20:14.173912 tcp.go:147: WARN Gap in tcp stream. last_seq: 314925966, seq: 314930725, gap: 4759 2016/04/24 05:20:14.173948 tcp.go:147: WARN Gap in tcp stream. last_seq: 4268890201, seq: 4268898334, gap: 8133
How can i resolve this? What's the best way to sniff 25 application servers?
You can do step by step debugging to find the actual issue
1)Are you seeing response times as zero for all transactions in elastic search ? I assume you are observing the response times for http transactions.
2)Only if for each request there is a related response,packetbeat will capture that as a transaction and calculate the response time.
So,as you are able to see the transactions in elasticsearch or logs we can rule out any issue with packetbeat not able to capture transactions.The response time being zero is a another point.
3)As per the logs that you shared ,looks like packetbeat is not able to correlate a response that it captured with any observed request.See here
You might want to check as some connections are dropped.
4)Other points that you could check are if you having any vlan tagging.
What is best for sniffing 25 app servers is something that I have not done but which you might want to first measure and then see if there are any performance problems and then decide.
1,2) Response time for transactions let it be an http, redis or mysql is being set to un realistic values. I can see it to be less than zero, which i can attribute to packet loss. I understand that port mirror and my sniffing mechanism are not the best.
I will work on that. But, in some cases i can see unrealistic response time which are a lot higher than the real world scenario. I can test the same with mysql slow query logs for sql and same goes for http that i can compare with accesslog response time. How, should i go about debugging this scenario.
I would look out for best solution in networking to reduce the packet loss and update on that
Currently, I have two set ups. One of them is on a prod test machine where packetbeats sits on the machine itself and i don't see much of a problem there, but there are few packet drops in that scenario also.
Second set up is a port mirror that is forwarding the request from two application servers on my dedicated server. ( This has the major set of problems i have listed above)
How should, i go about debugging step 1 in which reponsetime is unusually higher, whose traces can't be seen in slow query logs and accessLog(tomcat)
Would, really appreciate if you can point me to the correct path. My aim is to use packetbeat, topbeat and dump data to ES and then create trigger on that basis.
Sorry for responding late.
I believe the response time is calculated as the difference in observation times of the first request packet and the corresponding first response packet for the same request.
So in that case even if there is packet loss,am not sure how come the response time is less than zero ?
It will be -1 for request's which don't have any response being sent like some memcache commands.
It might be zero in case the response time is small i.e in microseconds.
For cases where response time is higher,it is about trying different options.Application based response time capturing,tracing a request and response if possible in tcpdump logs,analyzing why there is packetloss (prod servers ?).
As these issues are more about your setup and with the limited info that I have, the packetbeat team might be able to help you more as they do answer questions.
Without complete pcap it's pretty hard to tell where negative response times come from.
As @kirann42 already explained, response times depend on first packet of request and response. If if completion of response message takes long, the first packet is used. The timestamps used to compute response times depend on OS and are provided by the sniffer.
Possible reasons for negative response times are miscorrelations, packet order getting mixed up and non-linearities in system clock. Packet-loss being a possible reason for correlation errors.
Regarding performance consider reducing traffic to be analyzed (do you really need to collect all network traffic, or just edge application servers?) and load-balancing. Load-balancing by splitting traffic among multiple devices having multiple packetbeat instances.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.