Http responsetime unrealistic results

Hi,

I am recording data from a number of Elasticsearch cluster. For some queries, I get huge unrealistic response time numbers (3 mins).
When I run the queries on the clusters, they return in 100ms~.

note: I didn't use PacketBeat template file.
can you please explain why this can happen? how does packetbeat calculate the response time (is it the "took" field from Elasticsearch response?).

Packetbeat doesn't use the "took" field, instead it looks at the timestamp of the request and the timestamp of the response. At the moment it actually doesn't look into the payload at all, so it doesn't know that you have Elasticsearch running, just that there's an application using HTTP.

If you have the feeling that the values are not realistic, it could be that the request is matched with the response from another request. We call this a correlation problem. Causes could be packet drops or parsing errors.

One way to check for correlation issues is to configure packetbeat to store both the full request and the response (send_request: true, send_response: false and include_body_for: ["application/json"]), then for the transactions that have unrealistic times, check if the two seem to match.

did you mean send_request: true, send_response: true?
because I have this configured already. the "include body for" too.
I do see now that the request and the response don't match (I get a different response when running the query on ElasticSearch directly).

How can I solve this correlation problem?

If this problem has no instant solution, can I use Logstash or some other tool to parse the response from ElasticSearch, and the "took" field and other fields into the JSON object?

Tudor, can you please help me fix this bug?
i can't use this tool because of the above.

Next step would be to identify why is the miss-correlation happening. The most likely reason is that Packetbeat drops messages when reading form the network interface.

Some questions:

  • How often does the miss-correlation happen? Maybe try putting them on a graph in Kibana (all transaction with responsetime > 30s) and see if they are clustered in some periods or randomly distributed.
  • What sniffer type do you use in Packetbeat? The defualt pcap or af_packet? Would be good to post the full config.
  • How many requests per second is Packetbeat seeing
  • If you do ifconfig eth0 where eth0 is the interface where it is sniffing, do you see any drops or errors?

You could try it like that, but probably won't be very easy.

Depending on what you are trying to accomplish, the slow query log and Marvel might also be helpful.

Hi Tudor, Thank you for the answer.

  1. this occurs in a randon time frame (I checked the the transactions with a negative response time, which i guess comes from the root cause)

  2. the config:

interfaces:
device: any

protocols:
http:
ports: [9200]
send_request: true
send_response: true
include_body_for: ["text/html", "application/json"]

mysql:
ports: [3306]

mongodb:
ports: [27017, 27019]
send_request: true
send_response: true

output:
redis:
enabled: true
host: "redishostname"
port: 6379

  1. i don't know for sure, because this is still in test environment and i delete yersterday's indices.

  2. I do see drops on some servers, BUT, when i run responsetime < 0, and visualize in kibana for the result's split across the servers, i get that most of the transactions came from servers without any drops.

This drops makes me worry, does the packetbeat has to do anything with the drops? those servers have the same hardware as production, and with the same exact traffic, i get drops on the test and not on prod.

Tudor, what should we do about it?

I used Logstash to parse the response, and use the took data as a new field, but I have a lot of transactions where the request doesn't match the response.

Is there a chance to get a trace with raw network packets? I'd like to have a look and see if we can improve correlation.

I can't get the production traffic out.
Ask me anything, I really need the traffic to correlate correctly.

Issue seem (!) to be resolved after upgrading (packetbeat's) Elasticsearch to 2.2.0 version (no sure why this is related) .
Edit: ** this did not solve the problem, just mitigated it

Any updates on this problem, i can see response time to be 0,1 for most of my traffic

Even I see that http responsetimes in either 0,1 and why not decimals? Is it rounding off?

I have the same problem. The response time for queries is mostly 0. Is this a bug?

Its been long I tested, I couldn't solve that at that time. I didn't dig deep into it, not sure if it is a bug.

Hi, Are you able to figure out solution to this problem?

@Akashi_Seih @jonatanzafar59 @Rachit_Puri Hi, Can you please help me if anyone if figured out the solution to this problem?