I have a 12 node cluster with 24G RAM running only ELasticsearch.
Elasticsearch is configured to use 11G of RAM as below.
-Xms11g
-Xmx11g
There is no indexing or querying happening on this cluster, this is a fresh deployment, basically sitting idle.
But after for a few days, machine crashes and I have to reboot the machine.
There is no Heap dump or any GC events in the elasticsearch logs.
THis has happened in 3 machines in the cluster now.
Upon checking dmesg it says
//Out of memory: Kill process 19590 (java) score 507 or sacrifice child
Below is the relevant dmesg output.
<>
Node 0 Normal free:45816kB min:59172kB low:73964kB high:88756kB active_anon:8kB inactive_anon:0kB active_file:544kB inactive_file:412kB unevictable:12650952kB isolated(anon):0kB isolated(file):128kB present:21719040kB mlocked:207080kB dir
ty:0kB writeback:0kB mapped:116920kB shmem:0kB slab_reclaimable:9872kB slab_unreclaimable:42008kB kernel_stack:5680kB pagetables:29844kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve: 0 0 0 0
Node 0 DMA: 14kB 18kB 216kB 132kB 264kB 0128kB 0256kB 0512kB 11024kB 12048kB 34096kB = 15564kB
Node 0 DMA32: 124kB 98kB 716kB 2932kB 1664kB 4128kB 6256kB 9512kB 81024kB 72048kB 154096kB = 92808kB
Node 0 Normal: 104334kB 238kB 016kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 1*4096kB = 46012kB
29638 total pagecache pages
109 pages in swap cache
Swap cache stats: add 116875, delete 116766, find 4926/7159
Free swap = 631592kB
Total swap = 1042140kB
6291440 pages RAM
139727 pages reserved
35020 pages shared
6079389 pages non-shared
[ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
[ 638] 0 638 2731 69 3 -17 -1000 udevd
[ 1149] 0 1149 2730 69 1 -17 -1000 udevd
[ 1264] 0 1264 2730 66 7 -17 -1000 udevd
[ 1714] 0 1714 7441 132 5 -17 -1000 auditd
[ 1738] 0 1738 1540 121 0 0 0 portreserve
[ 1748] 0 1748 63919 217 1 0 0 rsyslogd
[ 1763] 0 1763 4586 104 2 0 0 irqbalance
[ 1785] 32 1785 4745 124 5 0 0 rpcbind
[ 1809] 29 1809 5838 180 6 0 0 rpc.statd
[ 1839] 0 1839 1671 92 2 0 0 vnstatd
[ 1850] 81 1850 5359 87 6 0 0 dbus-daemon
[ 1888] 0 1888 1019 131 6 0 0 acpid
[ 1900] 68 1900 9581 200 2 0 0 hald
[ 1901] 0 1901 5099 132 1 0 0 hald-runner
[ 1934] 0 1934 5629 119 2 0 0 hald-addon-inpu
[ 1948] 68 1948 4501 162 6 0 0 hald-addon-acpi
[ 2126] 0 2126 43169 249 1 0 0 vmtoolsd
[ 2179] 0 2179 96539 180 2 0 0 automount
[ 2311] 0 2311 7979 124 2 -17 -1000 sshd
[ 2324] 0 2324 5428 161 6 0 0 xinetd
[ 2335] 38 2335 7685 239 7 0 0 ntpd
[ 2481] 0 2481 20253 238 1 0 0 master
[ 2491] 89 2491 20316 238 7 0 0 qmgr
[ 2495] 0 2495 45773 202 3 0 0 abrtd
[ 2523] 0 2523 29221 148 5 0 0 crond
[ 2538] 0 2538 5276 68 4 0 0 atd
[ 2958] 0 2958 148548 232 0 0 0 salt-minion
[ 2959] 0 2959 123917 59 6 0 0 salt-minion
[ 2987] 0 2987 16119 84 0 0 0 certmonger
[ 3020] 0 3020 1015 115 3 0 0 mingetty
[ 3024] 0 3024 1015 115 3 0 0 mingetty
[ 3027] 0 3027 1015 115 7 0 0 mingetty
[ 3030] 0 3030 1015 115 1 0 0 mingetty
[ 3034] 0 3034 1015 115 4 0 0 mingetty
[ 3038] 0 3038 1015 115 3 0 0 mingetty
[19590] 61947 19590 4583234 3162869 1 0 0 java
[12304] 89 12304 20273 225 1 0 0 pickup
Out of memory: Kill process 19590 (java) score 507 or sacrifice child
Killed process 19590, UID 61947, (java) total-vm:18332936kB, anon-rss:12534556kB, file-rss:116920kB
[0]: VMCI: Updating context from (ID=0xea4a4c50) to (ID=0xea4a4c50) on event (type=0).
</>