I am recording data from a number of Elasticsearch cluster. For some queries, I get huge unrealistic response time numbers (3 mins).
When I run the queries on the clusters, they return in 100ms~.
note: I didn't use PacketBeat template file.
can you please explain why this can happen? how does packetbeat calculate the response time (is it the "took" field from Elasticsearch response?).
Packetbeat doesn't use the "took" field, instead it looks at the timestamp of the request and the timestamp of the response. At the moment it actually doesn't look into the payload at all, so it doesn't know that you have Elasticsearch running, just that there's an application using HTTP.
If you have the feeling that the values are not realistic, it could be that the request is matched with the response from another request. We call this a correlation problem. Causes could be packet drops or parsing errors.
One way to check for correlation issues is to configure packetbeat to store both the full request and the response (send_request: true, send_response: false and include_body_for: ["application/json"]), then for the transactions that have unrealistic times, check if the two seem to match.
did you mean send_request: true, send_response: true?
because I have this configured already. the "include body for" too.
I do see now that the request and the response don't match (I get a different response when running the query on ElasticSearch directly).
How can I solve this correlation problem?
If this problem has no instant solution, can I use Logstash or some other tool to parse the response from ElasticSearch, and the "took" field and other fields into the JSON object?
Next step would be to identify why is the miss-correlation happening. The most likely reason is that Packetbeat drops messages when reading form the network interface.
Some questions:
How often does the miss-correlation happen? Maybe try putting them on a graph in Kibana (all transaction with responsetime > 30s) and see if they are clustered in some periods or randomly distributed.
What sniffer type do you use in Packetbeat? The defualt pcap or af_packet? Would be good to post the full config.
How many requests per second is Packetbeat seeing
If you do ifconfig eth0 where eth0 is the interface where it is sniffing, do you see any drops or errors?
i don't know for sure, because this is still in test environment and i delete yersterday's indices.
I do see drops on some servers, BUT, when i run responsetime < 0, and visualize in kibana for the result's split across the servers, i get that most of the transactions came from servers without any drops.
This drops makes me worry, does the packetbeat has to do anything with the drops? those servers have the same hardware as production, and with the same exact traffic, i get drops on the test and not on prod.
I used Logstash to parse the response, and use the took data as a new field, but I have a lot of transactions where the request doesn't match the response.
Issue seem (!) to be resolved after upgrading (packetbeat's) Elasticsearch to 2.2.0 version (no sure why this is related) .
Edit: ** this did not solve the problem, just mitigated it
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.