Hi there,
I have an ES instance that chokes out on CPU when it is doing searches that span a long timeframe (> 12 hours):
top - 12:36:34 up 6 days, 5:00, 1 user, load average: 2.41, 1.73, 0.89
Tasks: 166 total, 1 running, 165 sleeping, 0 stopped, 0 zombie
%Cpu0 : 99.0 us, 0.7 sy, 0.0 ni, 0.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 98.0 us, 1.3 sy, 0.0 ni, 0.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 99.3 us, 0.0 sy, 0.0 ni, 0.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 99.0 us, 0.3 sy, 0.0 ni, 0.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 65982164 total, 9197488 free, 18018484 used, 38766192 buff/cache
KiB Swap: 0 total, 0 free, 0 used. 47095628 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23514 elastic+ 20 0 77.844g 0.019t 2.772g S 395.7 31.1 3363:49 java
573 root 20 0 0 0 0 S 0.3 0.0 13:24.44 jbd2/sdc1-8
25103 kibana 20 0 1280316 118080 21356 S 0.3 0.2 44:11.49 node
1 root 20 0 57088 6776 5244 S 0.0 0.0 0:03.06 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.08 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:24.15 ksoftirqd/0
This is on an ESXi host, where I'm limited to 4 vCPUs per VM, so I think I need to horizontally scale the ES portion of my stack.
I'm fuzzy on how to do that based on my goals. Here's what I'm after:
- ES is able to return results to these long queries. I get that the long queries may take some time to run, and that's fine. I don't necessarily need the results to be produced instantaneously.
- I don't care about having one ES instance be able to take over for the other in the event of a failure, because I believe that in doing so, data must be duplicated across both ES instances, and I don't want to consume 30GB/day * 2
I currently have one logstash instance feeding one elasticsearch instance (each instance is on it's own VM). My Kibana instance is on the same VM as the elasticsearch instance.
I'm not sure if there's a doc that describes the different ways ELK can be scaled depending on what I'm after. If so, I'd love to be pointed to the relevant links. If not, hopefully someone here is willing to provide some insight/guidance.
FWIW, the stack is v6.2.4.
Thanks!