Client nodes killed by kernel OOM killer

For a couple of months I have been using ES 6.4.2 and my client nodes get reaped/oom-killed by the linux kernel, due to which I do not get a heap dump. On taking a force heap dump I get the below dominator tree.

The heap settings were 9GB on a 15GB RAM machine.

After reducing the heap usage to 7GB, i have not seen the OOM killing yet.

Can i get help with getting an explanation for why the killing occurred and stopped? I would like to be more informed as i see that in case of load increase there have been suggestions to increase heap size.

By "client node" do you mean a coordinating-only node or do you mean a node running some kind of client such as the transport client?

Can you share the entirety of the kernel messages that were emitted when the OOM killer did its thing? They're normally accessible using dmesg as long as it wasn't too far in the past.

Do you have any nonstandard kernel settings regarding memory allocation?

One possibility is that there is something else on the same machine consuming memory. The OOM killer picks one process to terminate, and is often going to pick Elasticsearch even when Elasticsearch is behaving properly.

However, in the manual page on setting the heap size it's recommended not to exceed 50% of the physical RAM in the machine. The reason given there is to allow space for the page cache, but another reason is that I think with a 9GB heap some JVMs may allocate up to 9GB of direct (i.e. off-heap) buffers in addition to the heap, and the total would exceed your available physical RAM. Reducing the heap size to 7GB will also reduce the default limit on the direct buffers to 7GB, keeping the total under 15GB.

Thank you for the quick response. I don't have any modified setting apart from heap size.
Yeah i mean coordinating-only node.

[6290050.221231] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[6290050.229744] 6133 total pagecache pages
[6290050.234766] 0 pages in swap cache
[6290050.239386] Swap cache stats: add 0, delete 0, find 0/0
[6290050.245548] Free swap  = 0kB
[6290050.249507] Total swap = 0kB
[6290050.254599] 3932061 pages RAM
[6290050.259687] 0 pages HighMem/MovableOnly
[6290050.265018] 83523 pages reserved
[6290050.270271] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
[6290050.279716] [ 1763]     0  1763     2742      103      10       3        0         -1000 udevd
[6290050.288360] [ 2139]     0  2139    27285       57      20       3        0             0 lvmetad
[6290050.296468] [ 2148]     0  2148     6799       49      16       3        0             0 lvmpolld
[6290050.305497] [ 2363]     0  2363     2353      125       9       3        0             0 dhclient
[6290050.313801] [ 2483]     0  2483     2353      120      10       3        0             0 dhclient
[6290050.322774] [ 2525]     0  2525   118374     1185      39       6        0             0 amazon-ssm-agen
[6290050.332746] [ 2538]     0  2538    13251      105      26       3        0         -1000 auditd
[6290050.342207] [ 2563]     0  2563    61955     1121      24       3        0             0 rsyslogd
[6290050.351581] [ 2579]     0  2579    23225       61      15       4        0             0 irqbalance
[6290050.362345] [ 2589]     0  2589     1630       25       9       3        0             0 rngd
[6290050.372743] [ 2608]    32  2608     8841       98      22       3        0             0 rpcbind
[6290050.384120] [ 2629]    29  2629     9983      202      23       3        0             0 rpc.statd
[6290050.393683] [ 2660]    81  2660     5461       64      15       3        0             0 dbus-daemon
[6290050.403662] [ 2695]     0  2695     1099       35       8       3        0             0 acpid
[6290050.416639] [ 2890]     0  2890     4797       44      14       3        0             0 atd
[6290050.426001] [ 3070]     0  3070   163579    19455     168       4        0             0 salt-minion
[6290050.434937] [ 3402]     0  3402    20134      206      41       3        0         -1000 sshd
[6290050.443084] [ 5048]     0  5048    23576      272      46       3        0             0 master
[6290050.453992] [ 5279]     0  5279    30420      163      18       3        0             0 crond
[6290050.462791] [ 5464]   498  5464    25760       99      22       3        0             0 chronyd
[6290050.472720] [ 6493]    89  6493    23607      281      48       3        0             0 qmgr
[6290050.481595] [ 7218]     0  7218  1091875   165311     524       7        0             0 java
[6290050.489498] [ 7786]     0  7786   220790    10606     102       6        0          -500 dockerd
[6290050.497512] [ 7796]     0  7796   198473     2384      58       6        0          -500 docker-containe
[6290050.507139] [ 7978]     0  7978     1090       25       8       3        0             0 mingetty
[6290050.518380] [ 7981]     0  7981     1090       25       8       3        0             0 mingetty
[6290050.531268] [ 7983]     0  7983     1090       24       7       3        0             0 mingetty
[6290050.542723] [ 7985]     0  7985     1090       24       8       3        0             0 mingetty
[6290050.554387] [ 7987]     0  7987     1090       24       8       3        0             0 mingetty
[6290050.567911] [ 7989]     0  7989     1090       25       7       3        0             0 mingetty
[6290050.579246] [ 7994]     0  7994     1627       31       9       3        0             0 agetty
[6290050.591076] [28983]    89 28983    23647      332      49       4        0             0 tlsmgr
[6290050.601999] [ 5529]     0  5529     1088       22       7       3        0             0 collectdmon
[6290050.613622] [ 5531]     0  5531  1866123    14024     239       8        0             0 collectd
[6290050.625308] [ 3697]     0  3697    47568      114      19       5        0          -500 docker-proxy
[6290050.638519] [ 3710]     0  3710    90165      923      29       5        0          -500 docker-proxy
[6290050.651582] [ 3717]     0  3717     1878      112       9       5        0          -999 docker-containe
[6290050.665068] [ 3738]  9739  3738  4249848  3478285    6895      20        0             0 java
[6290050.676272] [ 5938]    89  5938    23590      266      47       3        0             0 pickup
[6290050.688550] [ 6352]     0  6352     2741      103       9       3        0         -1000 udevd
[6290050.699718] [ 6353]     0  6353     2741      104       9       3        0         -1000 udevd
[6290050.710674] [11926]     0 11926   227979    35016     226       4        0             0 salt-minion
[6290050.718678] [12273]     0 12273   135508    47372     221       3        0             0 yum
[6290050.725271] [12280]     0 12280     2741      104       9       3        0         -1000 udevd
[6290050.732140] [12281]     0 12281     2741      104       9       3        0         -1000 udevd
[6290050.739197] [12282]     0 12282     2741      104       9       3        0         -1000 udevd
[6290050.747372] [12283]     0 12283     2741      104       9       3        0         -1000 udevd
[6290050.755627] Out of memory: Kill process 3738 (java) score 905 or sacrifice child
[6290050.761946] Killed process 3738 (java) total-vm:16999392kB, anon-rss:13891240kB, file-rss:21900kB, shmem-rss:0kB
[6290051.075764] oom_reaper: reaped process 3738 (java), now anon-rss:9906316kB, file-rss:22020kB, shmem-rss:0kB

Thanks; I'd expect a few more messages before this too, but this looks like it implicates the memory usage of direct buffers:

The JVM had over 13GB resident; there's a handful of other processes consuming memory too, so I think it all got too close to 15GB. Since the machine can't support Elasticsearch taking 13GB of memory then you might even want to consider setting the heap size lower than 7GB.

I have attached the other half as well. Reducing heap size would affect query speed etc. right? The same configuration worked very well in es2. So something takes up more memory in es6?

[6290049.896673] Hardware name: Xen HVM domU, BIOS 4.2.amazon 08/24/2006
[6290049.904861] Call Trace:
[6290049.909736]  dump_stack+0x5c/0x82
[6290049.915090]  dump_header+0x94/0x21c
[6290049.920512]  ? _raw_spin_lock+0x10/0x20
[6290049.926480]  oom_kill_process+0x213/0x410
[6290049.932787]  out_of_memory+0x296/0x4c0
[6290049.938449]  __alloc_pages_slowpath+0x9ef/0xdd0
[6290049.944541]  __alloc_pages_nodemask+0x207/0x220
[6290049.950621]  filemap_fault+0x1e3/0x5f0
[6290049.957028]  ? filemap_map_pages+0x1fa/0x300
[6290049.964112]  ext4_filemap_fault+0x2c/0x40 [ext4]
[6290049.970721]  __do_fault+0x20/0x60
[6290049.975804]  __handle_mm_fault+0xdad/0x1460
[6290049.982408]  handle_mm_fault+0xaa/0x1e0
[6290049.988377]  __do_page_fault+0x22e/0x4c0
[6290049.994657]  ? page_fault+0x2f/0x50
[6290050.000040]  page_fault+0x45/0x50
[6290050.005426] RIP: 9eeb75a8:0x7f9e9c056050
[6290050.011697] RSP: 9c0561e8:00007f9e9fd27da0 EFLAGS: 7f9eaff3f080
[6290050.011723] Mem-Info:
[6290050.024735] active_anon:1295208 inactive_anon:17 isolated_anon:0
 active_file:138 inactive_file:211 isolated_file:0
 unevictable:2482084 dirty:1 writeback:0 unstable:0
 slab_reclaimable:6885 slab_unreclaimable:9531
 mapped:5644 shmem:20 pagetables:9153 bounce:0
 free:32185 free_pcp:664 free_cma:0
[6290050.061555] Node 0 active_anon:5180832kB inactive_anon:68kB active_file:376kB inactive_file:696kB unevictable:9928336kB isolated(anon):0kB isolated(file):0kB mapped:22740kB dirty:12kB writeback:0kB shmem:80kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB u
nstable:0kB all_unreclaimable? no
[6290050.089545] Node 0 DMA free:15904kB min:68kB low:84kB high:100kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15988kB managed:15904kB mlocked:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0k
B free_cma:0kB
[6290050.112629] lowmem_reserve[]: 0 3720 14997 14997
[6290050.117340] Node 0 DMA32 free:62828kB min:16748kB low:20932kB high:25116kB active_anon:3629340kB inactive_anon:4kB active_file:108kB inactive_file:336kB unevictable:104816kB writepending:0kB present:3915776kB managed:3830836kB mlocked:104816kB kernel_stack:1408kB pagetables:864
4kB bounce:0kB free_pcp:1156kB local_pcp:296kB free_cma:0kB
[6290050.138553] lowmem_reserve[]: 0 0 11276 11276
[6290050.143321] Node 0 Normal free:51464kB min:50764kB low:63452kB high:76140kB active_anon:1551488kB inactive_anon:64kB active_file:300kB inactive_file:644kB unevictable:9823520kB writepending:16kB present:11796480kB managed:11547412kB mlocked:9823520kB kernel_stack:5520kB pagetab
les:27968kB bounce:0kB free_pcp:912kB local_pcp:116kB free_cma:0kB
[6290050.166481] lowmem_reserve[]: 0 0 0 0
[6290050.170793] Node 0 DMA: 0*4kB 0*8kB 0*16kB 1*32kB (U) 2*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15904kB
[6290050.185621] Node 0 DMA32: 600*4kB (UME) 290*8kB (UME) 427*16kB (UME) 316*32kB (UME) 202*64kB (UME) 117*128kB (UME) 45*256kB (UM) 0*512kB 0*1024kB 1*2048kB (H) 0*4096kB = 63136kB
[6290050.202742] Node 0 Normal: 396*4kB (UEH) 210*8kB (UMEH) 234*16kB (UMEH) 189*32kB (UME) 164*64kB (UME) 98*128kB (UME) 45*256kB (UE) 5*512kB (UME) 1*1024kB (H) 0*2048kB 0*4096kB = 51200kB

Thanks, I think that also points towards Elasticsearch just using more memory than you've got.

It might slow things down but it might also speed things up. You can only really tell by experiment.

Seems so. I count over 25,000 changes between versions 2.4.6 and 6.4.2, some of which are quite dramatic. I wouldn't want to speculate on which of them are a factor here.

So i turned down the heap size to 1G and tried it out for a week and the clients crashed today. Had the following stack trace:

[2019-05-03T16:58:12,238][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:182)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:73)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)
[2019-05-03T16:58:12,276][ERROR][o.e.t.n.Netty4Utils      ] fatal error on the network layer
        at org.elasticsearch.transport.netty4.Netty4Utils.maybeDie(Netty4Utils.java:182)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:73)
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285)
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)

and also OOM

[2019-05-03T16:58:12,291][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [metabase-es6-client-us1-6] fatal error in thread [Thread-4], exiting
java.lang.OutOfMemoryError: Java heap space
        at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:99) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:96) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:53) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.recycler.AbstractRecycler.obtain(AbstractRecycler.java:33) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:28) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.recycler.Recyclers$3.obtain(Recyclers.java:119) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.PageCacheRecycler.bytePage(PageCacheRecycler.java:147) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.AbstractBigArray.newBytePage(AbstractBigArray.java:117) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.BigByteArray.resize(BigByteArray.java:143) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.BigArrays.resizeInPlace(BigArrays.java:448) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.BigArrays.resize(BigArrays.java:495) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.util.BigArrays.grow(BigArrays.java:512) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.io.stream.BytesStreamOutput.ensureCapacity(BytesStreamOutput.java:157) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.io.stream.ReleasableBytesStreamOutput.ensureCapacity(ReleasableBytesStreamOutput.java:69) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.io.stream.BytesStreamOutput.writeBytes(BytesStreamOutput.java:89) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.transport.CompressibleBytesOutputStream.writeBytes(CompressibleBytesOutputStream.java:85) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:459) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.bytes.BytesReference.writeTo(BytesReference.java:86) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.common.io.stream.StreamOutput.writeBytesReference(StreamOutput.java:203) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.search.SearchHit.writeTo(SearchHit.java:801) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.search.SearchHits.writeTo(SearchHits.java:213) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.search.internal.InternalSearchResponse.writeTo(InternalSearchResponse.java:63) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.action.search.SearchResponse.writeTo(SearchResponse.java:385) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1280) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1233) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1207) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:66) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:60) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) ~[elasticsearch-6.4.2.jar:6.4.2]
        at org.elasticsearch.action.support.HandledTransportAction$TransportHandler$1.onResponse(HandledTransportAction.java:83) ~[elasticsearch-6.4.2.jar:6.4.2]
[2019-05-03T16:58:12,291][WARN ][o.e.x.s.t.n.SecurityNetty4ServerTransport] [metabase-es6-client-us1-6] exception caught on transport layer [NettyTcpChannel{localAddress=/172.18.42.2:45568, remoteAddress=metabase-es-184--hhac.int.us1.signalfx.com/10.7.66.196:9300}], closing connection
org.elasticsearch.ElasticsearchException: java.lang.OutOfMemoryError: Java heap space
        at org.elasticsearch.transport.netty4.Netty4Transport.exceptionCaught(Netty4Transport.java:237) [transport-netty4-6.4.2.jar:6.4.2]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.exceptionCaught(Netty4MessageChannelHandler.java:74) [transport-netty4-6.4.2.jar:6.4.2]
        at io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:285) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.notifyHandlerException(AbstractChannelHandlerContext.java:850) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:364) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265) [netty-codec-4.1.16.Final.jar:4.1.16.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
        at
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1359) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:935) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:645) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:545) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:499) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459) [netty-transport-4.1.16.Final.jar:4.1.16.Final]
            at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.16.Final.jar:4.1.16.Final]
            at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
    Caused by: java.lang.OutOfMemoryError: Java heap space
            at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:99) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.util.PageCacheRecycler$1.newInstance(PageCacheRecycler.java:96) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:53) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.recycler.AbstractRecycler.obtain(AbstractRecycler.java:33) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.recycler.DequeRecycler.obtain(DequeRecycler.java:28) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.recycler.Recyclers$3.obtain(Recyclers.java:119) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.recycler.FilterRecycler.obtain(FilterRecycler.java:39) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.util.PageCacheRecycler.bytePage(PageCacheRecycler.java:147) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.util.AbstractBigArray.newBytePage(AbstractBigArray.java:117) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.util.BigByteArray.resize(BigByteArray.java:143) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.util.BigArrays.resizeInPlace(BigArrays.java:448) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.util.BigArrays.resize(BigArrays.java:495) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.util.BigArrays.grow(BigArrays.java:512) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.io.stream.BytesStreamOutput.ensureCapacity(BytesStreamOutput.java:157) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.io.stream.ReleasableBytesStreamOutput.ensureCapacity(ReleasableBytesStreamOutput.java:69) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.io.stream.BytesStreamOutput.writeBytes(BytesStreamOutput.java:89) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.transport.CompressibleBytesOutputStream.writeBytes(CompressibleBytesOutputStream.java:85) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.io.stream.StreamOutput.write(StreamOutput.java:459) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.bytes.BytesReference.writeTo(BytesReference.java:86) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.common.io.stream.StreamOutput.writeBytesReference(StreamOutput.java:203) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.search.SearchHit.writeTo(SearchHit.java:801) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.search.SearchHits.writeTo(SearchHits.java:213) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.search.internal.InternalSearchResponse.writeTo(InternalSearchResponse.java:63) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.action.search.SearchResponse.writeTo(SearchResponse.java:385) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.transport.TcpTransport.buildMessage(TcpTransport.java:1280) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1233) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.transport.TcpTransport.sendResponse(TcpTransport.java:1207) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:66) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:60) ~[elasticsearch-6.4.2.jar:6.4.2]
            at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54)

Also in the heap dumps we see incoming references as below. Since we do not use any of the xpack features could that also explain the case for OOM reaper killing?

No, that looks normal.

I think you are hitting an OutOfMemoryError because you have dramatically reduced the heap space available to this node. At the start of this thread you were comparing the behaviour with 7GB vs 9GB of heap, but you now say you have reduced it all the way to 1GB. It is not clear why you have done this, but I am not surprised that such a small node can no longer keep up with the workload.

Note that this is different behaviour from the original problem. It's no longer the kernel OOM killer that's shutting this node down, which means that you are not running out of physical RAM here.

@DavidTurner thanks such quick responses. Turning down the heap to 1G dramatically, is totally my fault. I was trying to understand the workload. The faster gc and heap trend seemed okay, until it wasn't.

I am trying to understand the sweet spot for the setting for our workload. A total noob question, is there a way/metric that i can use from elasticsearch to monitor workload?
Since the client/coordinating node does both distributing bulk indexing and scatter-gather of searches, what metrics would help me gain more insight quickly?

Interesting. Was there a sudden shift in workload? When you hit OOM you should get a heap dump which might be worth investigating.

It might also be worth upgrading to 7.0 since it's quite a bit more resilient to heap pressure, preferring to reject requests instead of just crashing like this.

I am still investigating. The only explanation i have now is that it looked like the queries started asking for more results at that time compared to historical data.

Hence need guidance on metrics to look at as stated above. Repasting here

I see x-pack-ml running on my container. Is that normal?

Mem: 7056020K used, 753792K free, 72K shrd, 58716K buff, 467096K cached
CPU:   6% usr   2% sys   0% nic  87% idle   0% io   0% irq   2% sirq
Load average: 0.78 0.76 0.66 3/431 182
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
    1     0 maestro  S    8152m 104%   1  11% /opt/jdk8/bin/java -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+AlwaysPreTouch -Xs
   74     1 maestro  S     135m   2%   0   0% /opt/elasticsearch-6.4.2/modules/x-pack-ml/platform/linux-x86_64/bin/controller
  173     0 root     S     6576   0%   2   0% bash
  182   173 root     R     1548   0%   0   0% top

It is unless you have disabled ML by setting xpack.ml.enabled: false.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.