Elasticsearch 5.4.2 process periodically dying with high CPU load and kernel message pgtable-generic.c:33: bad pmd

Maybe someone faced similar problem. Data nodes (cluster 4xdata nodes) periodically dying with dmesg message:

[Fri Jul 7 02:19:34 2017] /home/kernel/COD/linux/mm/pgtable-generic.c:33: bad pmd ffff95e5b0039500(0000001b9ea009e2)
[Fri Jul 7 03:41:30 2017] BUG: Bad rss-counter state mm:ffff95e58d63be00 idx:1 val:512
[Fri Jul 7 03:41:30 2017] BUG: non-zero nr_ptes on freeing mm: 1

System has 264GB of RAM, elasticsearch process has 30G of memory assigned. We write on average of 1.2 TB of index during the day. First I suspected issue with kernel (default kernel is 4.4, but after upgrade to kernel 4.10.0 issue keep occurring)

We have similar cluster built with ES 1.7.5 with Ubuntu 15.04 kernel 3.19.0 which receives similar amount of writes/reads - this cluster does not have such problem.

Below you can see output from atop during the interval when outage happened, note weird CPU load (avg1 430.99| avg5 378.26):

PRC | sys 8.79s | user 9m57s | #proc 492 | #trun 2 | #tslpi 511 | #tslpu 430 | #zombie 0 | clones 429 | #exit 431 |

CPU | sys 1% | user 100% | irq 0% | idle 3899% | wait 0% | steal 0% | guest 0% | curf 1.94GHz | curscal 88% |

CPL | avg1 430.99| avg5 378.26 | avg15 215.31 | csw 112322 | intr 742898 | numcpu 40 |

MEM | tot 251.8G | free 15.6G | cache 196.4G | dirty 0.9M | buff 420.3M | slab 5.2G |

SWP | tot 7.4G | free 7.4G | vmcom 32.8G | vmlim 133.4G |

PID MINFLT MAJFLT VSTEXT VSIZE RSIZE VGROW RGROW UID EUID MEM CMD

2924 0 0 2K 4.2T 92.6G 0K 0K elastics elastics 37% java

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

@Ruslan_Lutsenko Saw that you also commented on https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1644056.
I did a bit of research and found https://lkml.org/lkml/2017/4/10/152, which indicates that this could be related to Transparent Huge Pages.
Mind trying disabling THP with:

echo -n never > /sys/kernel/mm/transparent_hugepage/enabled

And see if that helps till the Kernel bug is fixed?
We recommend disabling THP anyways.