Packet loss while decoding REDIS protocol data

zjxian · January 25, 2018, 5:49am

We're planning on monitor REDIS protocol communication with packetbeat. What we want from packetbeat is to decode and record REDIS commands between server and clients.
The test environment is somewhat like this:

Redis Server: VM with 4vCPU and 8g RAM
Redis Client for benchmark: my notebook ...
Redis version: 4.0.6, with default config

The packetbeat is running on the redis server, sniffing the interface redis server listens on. The output is set to local file.

We've tested with redis-benchmark and redis-cli.

With redis-benchmark

We only test redis command LPUSH, push 100K strings onto redis server. All pushes are successful. When redis-benchmark finished, length of key "mylist" is exactly 100K. As shown in the result of redis-benchmark, the command execution throughput is around 50K to 60K per second.
Sadly, in the output file of packetbeat, we only found around 40K to 50K of LPUSH commands, which means around 50% to 60% packet loss. The log of packetbeat itself show "redis.unmatched_response" and "tcp_dropped_because_of_gaps", however these numbers still didn't add up.
We've further tested with redis-benchmark while setting "keepalive off", the packet loss was even greater.

With redis-cli

After the tests with redis-benchmark, we've tried redis-cli for the testing as it can set the interval between each command, so that we might manually control the speed of REDIS commands.
We've tested with:
interval set from 0.1s to 0.01s, 5K commands (LPUSH) each test run , packetbeat show no loss. However, the throughput of REDIS commands is "quite" poor.
interval set to 0.0s, 10K commands (LPUSH), packetbeast show about 2% loss. The speed is around 1K/s.

As of now, I don't have the first-hand result of those tests, but if anyone is interested, later I may run these tests again and update with some exact numbers.

tudor · February 1, 2018, 1:24pm

I'm not too surprised that packetbeat can't keep up with redis-benchmark, but the drop rate seems larger that what I'd expect. I suspect redis-bechmark is making heavily use of pipelining, which might make Packetbeat mission to reconstruct the streams harder.

What was the CPU usage of Packetbeat during the test? Also, did the network interface report any drops in the RX queue?

zjxian · February 1, 2018, 2:09pm

That's exactly how I felt during those tests.

About redis-benchmark using pipelining, from my understanding from the man page of redis-benchmark, by using "keep alive" means redis-benchmark will keep 1 connection always alive and send all the commands through this connection/pipeline. In my previous tests, I've tested with keepalive off, which I suppose means no keepalive connection and every command is sent separately. However, the packetloss is even greater.

The CPU usage was no quite significant during my previous tests. Average system load was around 4 to 6 on a VM with 4 vCPU. If I remembered right, packetbeat spawned around 8 or 10 threads, total CPU usage of those threas was around 300% to 400%. And I'm quite sure there's no constantly busy thread ( 100% cpu usage all the time ).

RX drops wasn't taken into consideration at that time. I'll try to record more stat in my future tests.

tudor · February 1, 2018, 2:14pm

I forgot to ask, did you use the afpacket sniffer type? Often most CPU is consumed by the sniffing itself, even before the Packetbeat code runs. Tuning snaplen and buffer_size_mb might help a bit.

zjxian · February 1, 2018, 2:15pm

My concern right now, is to identify under what kind of conditions will packetbeat start to "lose" some packets/redis-commands, or under what kind of conditions packetbeat can guarantee no packetloss. The overall network traffic/redis communication I need to decode is quite huge, but that can be splitted and load-balanced.

zjxian · February 1, 2018, 2:19pm

Thank you for your remind.

I'll check these configurations and try to schedule another round of tests tomorrow morning. However, my time zone is +0800, so that's some hours later .

zjxian · February 1, 2018, 2:34pm

Also, about traffice capturing, is it still possible to use pf_ring for packetbeat? I remember packetbeat dropped this feature for some reason in some quite-early verion ( maybe 2.X ? ).
My task is to monitor and audit the transactions happened on dozens of redis servers, and they can be really busy during some critical periods. I've already considering some 10/40Gb network traffic split / load-balancing hardware. I hope I can try as many possiblities as I can.

tudor · February 1, 2018, 3:12pm

We removed the pfring option because it didn't see usage and it broke when some Cgo changes happend around Go 1.4.

If your goal is auditing, so no packet can be lost, it might be better to save the raw traffic via a dedicated sniffer (either commercial offering, or pfring, or something else) and then use packetbeat in offline mode (the -I flag) to index the traffic. This will require a lot of disk space, but gives you a "trail log" and you can scale the capturing and indexing separately.

If you still choose to use Packetbeat in online mode (this can still be advantageous), I have a couple of recommendation. Some of these might apply also if you use another tool to capture the traffic.

Use NICs based with newer Intel chipsets
load balance the traffic to multiple NICs, and if needed, multiple servers. The load balancing needs to send the request and response over to the same NIC.
I'd aim to have max 50k PPS (packets per second) per NIC. That should be well within the sniffing power of af_packet.
start a packetbeat process for each network device
Use af_packet and use a large buffer size

HTH. Just to manage expectations, be prepared to spend some time figuring things out, we don't see a ton of deployments like this.

zjxian · February 1, 2018, 3:33pm

Thank you for your advice.
Some of your recommendations have been taken into account, like faster NICs, load-balance to several servers. pf_ring or afpacket is just something I'm thinking of when choosing between "fewer but stronger servers with pf_ring" or "more weaker (comparatively) servers with afpacket". But still I need a rough estimation of the process ability of packetbeat under certain hardware configuration with no packet loss. - -||

system · March 1, 2018, 3:33pm

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Redis packet data is loss Beats packetbeat	1	363	September 19, 2019
Monitoring the Redis using pacetbet Beats packetbeat	3	1426	July 5, 2017
Can someone suggest tips for debugging Redis traffic listening? Beats packetbeat	6	1206	August 15, 2017
[SOLVED] Redis, memory leak issue Beats packetbeat	4	7946	July 5, 2017
Packetbeat - “ERR Failed to read integer reply: Expected digit” Beats packetbeat	6	1309	March 10, 2017

Packet loss while decoding REDIS protocol data

Related topics