Elasticsearch 5.4.2 process periodically dying with high CPU load and kernel message pgtable-generic.c:33: bad pmd

Ruslan_Lutsenko · July 7, 2017, 9:00am

Maybe someone faced similar problem. Data nodes (cluster 4xdata nodes) periodically dying with dmesg message:

[Fri Jul 7 02:19:34 2017] /home/kernel/COD/linux/mm/pgtable-generic.c:33: bad pmd ffff95e5b0039500(0000001b9ea009e2)
[Fri Jul 7 03:41:30 2017] BUG: Bad rss-counter state mm:ffff95e58d63be00 idx:1 val:512
[Fri Jul 7 03:41:30 2017] BUG: non-zero nr_ptes on freeing mm: 1

System has 264GB of RAM, elasticsearch process has 30G of memory assigned. We write on average of 1.2 TB of index during the day. First I suspected issue with kernel (default kernel is 4.4, but after upgrade to kernel 4.10.0 issue keep occurring)

We have similar cluster built with ES 1.7.5 with Ubuntu 15.04 kernel 3.19.0 which receives similar amount of writes/reads - this cluster does not have such problem.

Below you can see output from atop during the interval when outage happened, note weird CPU load (avg1 430.99| avg5 378.26):

CPL | avg1 430.99| avg5 378.26 | avg15 215.31 | csw 112322 | intr 742898 | numcpu 40 |

PID MINFLT MAJFLT VSTEXT VSIZE RSIZE VGROW RGROW UID EUID MEM CMD

2924 0 0 2K 4.2T 92.6G 0K 0K elastics elastics 37% java

system · August 4, 2017, 9:00am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

jakommo · September 8, 2017, 4:21pm

@Ruslan_Lutsenko Saw that you also commented on https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1644056.
I did a bit of research and found https://lkml.org/lkml/2017/4/10/152, which indicates that this could be related to Transparent Huge Pages.
Mind trying disabling THP with:

echo -n never > /sys/kernel/mm/transparent_hugepage/enabled

And see if that helps till the Kernel bug is fixed?
We recommend disabling THP anyways.

Topic		Replies	Views
Elasticsearch 5.1.1 keep dying after 20 minutes Elasticsearch	15	2918	February 6, 2017
Elasticsearch process on a node takes 99% of RAM as seen by top, and eventually gets killed by kernel Elasticsearch	1	742	July 6, 2017
Elasticsearch 5.2.2 - Out of memory - while indexing Elasticsearch	6	4822	April 13, 2017
Elasticsearch crashes ubuntu vm Elasticsearch	8	1490	February 20, 2018
Troubleshooting CPU load when idle Elasticsearch	2	1227	May 31, 2017

Elasticsearch 5.4.2 process periodically dying with high CPU load and kernel message pgtable-generic.c:33: bad pmd

Related topics