we've got problems with the indexing rate of elasticsearch at a customer's installation. Our system is a single node redis -> logstash (5.6.1) -> elasticsearch (5.6.1) pipeline.
we've run several different tests in our lab and in the customers deployment,
- indexing 500MB of test data with our test pipeline took 2min54s, at the customers system 9min06s
- another test, with a different pipeline (a sip - megaco- call with 45 messages / call) at our development server could manage 113calls/s, on our customers site only 46calls/s
to isolate the problem we wrote a test to index messages directly into elasticsearch (call with 29 messages, about 27kb json data, the results were:
development: 12667 messages / second
customer: 3427 messages / second
local macbook pro 2015: 7966 messages/ second
our customer has a mounted ssd, IOSTAT shows around 15mb/s
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
vda 1.88 0.53 38.07 32 2284
vdb 506.18 68.20 14209.00 4092 852540
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
vda 2.40 71.13 29.27 4268 1756
vdb 498.27 207.80 13844.40 12468 830664
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
vda 3.50 7.00 35.07 420 2104
vdb 575.10 396.27 17268.47 23776 1036108
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
vda 3.08 3.73 46.53 224 2792
vdb 584.55 326.00 18992.93 19560 1139576
Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
vda 1.73 6.27 18.47 376 1108
vdb 592.80 584.13 19776.93 35048 1186616
TOP shows usually a high load, but 1/3 of CPU seem idle, it doesn't keep pace with the logs sent (Redis queue grows until memory is limited)
top - 15:35:50 up 23:12, 8 users, load average: 13.91, 14.95, 14.63
Tasks: 245 total, 1 running, 244 sleeping, 0 stopped, 0 zombie
%Cpu0 : 20.6 us, 2.8 sy, 35.7 ni, 33.9 id, 1.0 wa, 0.0 hi, 1.4 si, 4.5 st
%Cpu1 : 38.5 us, 2.1 sy, 22.3 ni, 30.0 id, 3.5 wa, 0.0 hi, 0.0 si, 3.5 st
%Cpu2 : 9.0 us, 1.7 sy, 34.1 ni, 50.3 id, 1.4 wa, 0.0 hi, 0.0 si, 3.4 st
%Cpu3 : 14.5 us, 1.4 sy, 24.8 ni, 54.8 id, 1.0 wa, 0.0 hi, 0.0 si, 3.4 st
%Cpu4 : 13.8 us, 1.4 sy, 34.3 ni, 46.7 id, 0.7 wa, 0.0 hi, 0.0 si, 3.1 st
%Cpu5 : 43.9 us, 1.4 sy, 17.2 ni, 31.9 id, 1.4 wa, 0.0 hi, 0.0 si, 4.2 st
%Cpu6 : 20.0 us, 2.1 sy, 42.8 ni, 28.8 id, 2.1 wa, 0.0 hi, 0.0 si, 4.2 st
%Cpu7 : 21.1 us, 2.8 sy, 29.8 ni, 39.1 id, 3.1 wa, 0.0 hi, 0.0 si, 4.2 st
%Cpu8 : 24.6 us, 2.5 sy, 28.4 ni, 37.2 id, 3.2 wa, 0.0 hi, 0.0 si, 4.2 st
%Cpu9 : 15.0 us, 1.4 sy, 34.1 ni, 42.9 id, 2.8 wa, 0.0 hi, 0.3 si, 3.5 st
%Cpu10 : 14.2 us, 1.4 sy, 21.5 ni, 57.8 id, 1.4 wa, 0.0 hi, 0.0 si, 3.8 st
%Cpu11 : 17.4 us, 1.7 sy, 28.1 ni, 46.2 id, 2.8 wa, 0.0 hi, 0.3 si, 3.5 st
%Cpu12 : 16.8 us, 2.1 sy, 34.6 ni, 40.6 id, 1.7 wa, 0.0 hi, 0.3 si, 3.8 st
%Cpu13 : 18.1 us, 3.1 sy, 31.7 ni, 40.4 id, 2.8 wa, 0.0 hi, 0.0 si, 3.8 st
%Cpu14 : 11.8 us, 1.7 sy, 40.3 ni, 40.6 id, 2.1 wa, 0.0 hi, 0.0 si, 3.5 st
%Cpu15 : 31.9 us, 1.8 sy, 26.3 ni, 34.7 id, 1.8 wa, 0.0 hi, 0.0 si, 3.5 st
KiB Mem : 24621160 total, 1793300 free, 15553140 used, 7274720 buff/cache
KiB Swap: 2096124 total, 1945872 free, 150252 used. 8589640 avail Mem
Our development system:
VM with 8 CPU assigned, 12 GB RAM
Logstash 5.6.1 with 8 logstash workers,
Elasticsearch 5.6.1, -Xms6g -Xmx6g
Chassis: DELL PowerEdge R630
CPU: Intel E5.2670v3 @ 2.30 GHz; 1 Socket / 12 Cores; HT enabled (24 vCPUs)
RAM: 64 GB
HDD: Seagate ST300MM0006 300GB SAS 10k 6Gb/s spinning disk
RAID: DELL PERC H730P Mini 12Gb/s 2048MB Cache
Volume1: RAID 10 (4 Disk 300GB SAS 10k)
Hypervisor: VMware 6.0U1
Our customer's system:
VM with 16 CPU assigned & pinned, 32 GB RAM
Logstash 5.6.1 with 16 logstash workers,
Elasticsearch 5.6.1 -Xms14405m -Xmx14405m
Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
KVM Hypervisor
SSD Disk (Consumer)
index settings for single log entries
number_of_shards: 2,
index.refresh_interval: 5s
index settings for summary log entries (there's a separate process creating and updating summary records)
number_of_shards: 1,
index.refresh_interval: 5s
indexes are created by day and level ( debug, info, ...)
So what could be the cause for this problem? thanks a lot!