I am indexing about 520GB of log files to elasticsearch a day. At this
phase I am only keeping 24 hours of data (eventually the goal is 7 days).
The incoming data is not being processed real-time, it is falling behind
30+ minutes during peak load. There are about 69,308,904 documents, 33GB
(66GB replicated) in an hour during the peak load. Any suggestions on how
I can optimize the cluster? Or, do I simply need to add more nodes to the
cluster?
Hardware:
4 node cluster, 24 CPU cores, 24GB of memory
Config:
16GB heap size, 4 shards, 1 replica
Here is the template I uses:
"template" : "logstash*",
"settings" : {
"number_of_shards" : 4,
"number_of_replicas" : 1,
"index.cache.field.type" : "soft",
"index.refresh_interval" : "5s",
"index.store.compress.stored" : "true",
"index.routing.allocation.total_shards_per_node" : 3
}
i.e. - Here is the resource usage of a node in the cluster
Very high run queue:
procs -------------------memory------------------ ---swap-- -----io----
--system-- -----cpu-------
r b swpd free buff cache si so bi bo
in cs us sy id wa st
0 0 29296 857568 115180 4170680 0 0 0 18536
2064 19307 5 0 94 0 0
16 0 29296 981432 115460 4045384 0 0 2 91284
26435 195012 32 6 61 1 0
25 0 29296 940864 115604 4086892 0 0 2 6410
19264 148894 24 4 72 0 0
26 0 29296 937804 115740 4089044 0 0 0 7610
20409 152666 24 4 72 0 0
13 0 29296 921072 115864 4108016 0 0 0 29050
19789 151698 23 4 72 0 0
10 0 29296 899636 116060 4128760 0 0 0 8922
22611 178752 29 5 66 0 0
27 0 29296 803672 116272 4223260 0 0 1300 21616
9254 59491 14 2 84 0 0
12 0 29296 703440 116476 4324696 0 0 1260 8730
21620 164412 34 5 61 0 0
2 0 29296 723592 116756 4303752 0 0 394 46396
20529 149679 27 5 68 0 0
1 0 29296 812524 117040 4215268 0 0 6 89006
30665 224822 35 7 57 1 0
23 0 29296 811320 117248 4215140 0 0 0 16118
16144 129557 20 3 76 0 0
5 3 29296 793556 117440 4230480 0 0 92 13534
17697 130477 21 3 75 0 0
18 0 29296 791652 117664 4234996 0 0 0 25726
15064 105674 16 3 81 0 0
0 0 29296 767412 117864 4257892 0 0 2 7026
23563 185956 29 5 66 0 0
32 0 29296 698344 118092 4325644 0 0 0 24436
18761 135696 26 4 70 0 0
25 0 29296 688636 118484 4333708 0 0 0 19960
21589 169049 28 5 67 0 0
16 0 29296 641116 118756 4381596 0 0 2 28256
19404 151200 27 4 68 0 0
16 0 29296 598248 118960 4425428 0 0 0 24886
20111 154420 26 4 70 0 0
0 0 29296 684804 119228 4336856 0 0 2 51210
19501 145059 23 4 72 0 0
3 0 29296 657820 119436 4351960 0 0 24 29936
21593 160447 27 5 68 0 0
10 0 29296 649772 119680 4368648 0 0 2 8284
20268 149946 23 5 72 0 0
24 0 29296 575948 119888 4443508 0 0 2 8762
19982 156834 31 5 64 0 0
12 0 29296 528372 120108 4490592 0 0 0 29946
15819 104973 19 3 77 1 0
23 0 29296 525860 120308 4495436 0 0 0 15698
21515 163041 27 5 69 0 0
i.e. - High load and CPU usage by the elasticsearch java process
top - 07:15:34 up 124 days, 13:04, 1 user, load average: 14.57, 12.50,
9.80
Tasks: 929 total, 1 running, 928 sleeping, 0 stopped, 0 zombie
Cpu0 : 36.3%us, 4.0%sy, 0.0%ni, 59.1%id, 0.3%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu1 : 32.5%us, 7.3%sy, 0.0%ni, 54.3%id, 6.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu2 : 33.0%us, 3.6%sy, 0.0%ni, 63.4%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu3 : 34.8%us, 5.9%sy, 0.0%ni, 55.1%id, 3.9%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu4 : 36.6%us, 4.3%sy, 0.0%ni, 59.1%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu5 : 32.9%us, 5.6%sy, 0.0%ni, 59.9%id, 1.3%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu6 : 32.7%us, 5.3%sy, 0.0%ni, 60.1%id, 1.7%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu7 : 24.0%us, 22.7%sy, 0.0%ni, 51.6%id, 1.3%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu8 : 33.8%us, 5.6%sy, 0.0%ni, 59.9%id, 0.7%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu9 : 36.2%us, 10.5%sy, 0.0%ni, 45.7%id, 6.2%wa, 0.0%hi, 1.3%si,
0.0%st
Cpu10 : 47.2%us, 5.0%sy, 0.0%ni, 47.5%id, 0.3%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu11 : 33.4%us, 15.6%sy, 0.0%ni, 48.3%id, 2.0%wa, 0.0%hi, 0.7%si,
0.0%st
Cpu12 : 37.4%us, 5.3%sy, 0.0%ni, 57.3%id, 0.0%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu13 : 34.5%us, 7.9%sy, 0.0%ni, 54.3%id, 3.0%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu14 : 64.7%us, 4.6%sy, 0.0%ni, 30.4%id, 0.3%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu15 : 30.1%us, 15.9%sy, 0.0%ni, 50.7%id, 3.3%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu16 : 38.7%us, 5.0%sy, 0.0%ni, 56.0%id, 0.3%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu17 : 54.5%us, 4.6%sy, 0.0%ni, 36.3%id, 4.6%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu18 : 39.8%us, 4.9%sy, 0.0%ni, 54.6%id, 0.3%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu19 : 34.1%us, 7.9%sy, 0.0%ni, 55.0%id, 2.6%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu20 : 35.7%us, 11.8%sy, 0.0%ni, 49.5%id, 2.6%wa, 0.0%hi, 0.3%si,
0.0%st
Cpu21 : 45.4%us, 10.2%sy, 0.0%ni, 34.5%id, 5.3%wa, 1.0%hi, 3.6%si,
0.0%st
Cpu22 : 38.9%us, 5.6%sy, 0.0%ni, 52.8%id, 2.6%wa, 0.0%hi, 0.0%si,
0.0%st
Cpu23 : 28.9%us, 10.5%sy, 0.0%ni, 33.9%id, 0.0%wa, 3.3%hi, 23.4%si,
0.0%st
Mem: 24675936k total, 24601936k used, 74000k free, 19960k buffers
Swap: 4192880k total, 29296k used, 4163584k free, 5052000k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17248 user 17 0 18.9g 17g 10m S 1089.8 74.3 2850:30 java
--