Observed kernel bug for Elasticsearch 7.17.5 on Debian 12

Hi all,

we are running an ES cluster in version 7.17.5 and some of our data nodes are already running on Debian 12. Now recently one data node failed. Elasticsearch itself did not log anything about the incident. Also systemd was off track afterwards - tools like journalctl didn't work properly anymore. We had to restart the server completely, after that the ES service started again without any problems.

So my question is: Are there known issues between certain Elasticsearch and kernel versions? We are aware that Debian 12 is not officially supported yet (in the support matrix Debian 12 is not mentioned at all) - what are the reasons for this? Is official support for Debian 12 coming in the near future? Or should we rather go back to Debian 11 at this point? Could it be that the situation will improve with an upgrade to ES 8.X?

UPDATE: This is the kernel version

$ uname -r
6.1.0-12-amd64

Thanks for any feedback or ideas!

Crash log
2023-10-09T00:27:32.279022+02:00 es_data_3 kernel: [569182.173825] BUG: kernel NULL pointer dereference, address: 0000000000000036
2023-10-09T00:27:32.279034+02:00 es_data_3 kernel: [569182.173848] #PF: supervisor read access in kernel mode
2023-10-09T00:27:32.279035+02:00 es_data_3 kernel: [569182.173856] #PF: error_code(0x0000) - not-present page
2023-10-09T00:27:32.279036+02:00 es_data_3 kernel: [569182.173863] PGD 875ddd067 P4D 875ddd067 PUD 0 
2023-10-09T00:27:32.279037+02:00 es_data_3 kernel: [569182.173871] Oops: 0000 [#1] PREEMPT SMP NOPTI
2023-10-09T00:27:32.279040+02:00 es_data_3 kernel: [569182.173879] CPU: 9 PID: 30752 Comm: elasticsearch[e Not tainted 6.1.0-12-amd64 #1  Debian 6.1.52-1
2023-10-09T00:27:32.279041+02:00 es_data_3 kernel: [569182.173891] Hardware name: HPE ProLiant DL360 Gen10/ProLiant DL360 Gen10, BIOS U32 01/23/2021
2023-10-09T00:27:32.279042+02:00 es_data_3 kernel: [569182.173902] RIP: 0010:__filemap_get_folio+0xad/0x340
2023-10-09T00:27:32.279045+02:00 es_data_3 kernel: [569182.173914] Code: 10 e8 37 69 75 00 48 89 c3 48 3d 02 04 00 00 74 e2 48 3d 06 04 00 00 74 da 48 85 c0 0f 84 25 02 00 00 a8 01 0f 85 27 02 00 00 <8b> 40 34 85 c0 74 c2 8d 50 01 f0 0f b1 53 34 75 f2 48 8b 54 24 28
2023-10-09T00:27:32.279060+02:00 es_data_3 kernel: [569182.173935] RSP: 0000:ffffb7c40fbf7c70 EFLAGS: 00010246
2023-10-09T00:27:32.279060+02:00 es_data_3 kernel: [569182.173943] RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000000002
2023-10-09T00:27:32.279061+02:00 es_data_3 kernel: [569182.173952] RDX: 0000000000000008 RSI: ffff96e69051ada0 RDI: ffffb7c40fbf7c80
2023-10-09T00:27:32.279066+02:00 es_data_3 kernel: [569182.173961] RBP: 0000000000000000 R08: 00000000000ee6cf R09: 00000000000ee6d0
2023-10-09T00:27:32.279066+02:00 es_data_3 kernel: [569182.173970] R10: ffffffffffffffc0 R11: 0000000000000000 R12: 0000000000000000
2023-10-09T00:27:32.279067+02:00 es_data_3 kernel: [569182.173980] R13: ffff96e45fb6dab0 R14: 00000000000ee6ca R15: ffff96e44f0e18e0
2023-10-09T00:27:32.279068+02:00 es_data_3 kernel: [569182.173989] FS:  00007fb9720ff6c0(0000) GS:ffff96ec1fac0000(0000) knlGS:0000000000000000
2023-10-09T00:27:32.279068+02:00 es_data_3 kernel: [569182.173999] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-10-09T00:27:32.279069+02:00 es_data_3 kernel: [569182.174007] CR2: 0000000000000036 CR3: 0000000e0b280003 CR4: 00000000007706e0
2023-10-09T00:27:32.279070+02:00 es_data_3 kernel: [569182.174017] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2023-10-09T00:27:32.279072+02:00 es_data_3 kernel: [569182.174025] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2023-10-09T00:27:32.279073+02:00 es_data_3 kernel: [569182.174034] PKRU: 55555554
2023-10-09T00:27:32.279073+02:00 es_data_3 kernel: [569182.174039] Call Trace:
2023-10-09T00:27:32.279074+02:00 es_data_3 kernel: [569182.174044]  <TASK>
2023-10-09T00:27:32.279077+02:00 es_data_3 kernel: [569182.174050]  ? __die_body.cold+0x1a/0x1f
2023-10-09T00:27:32.279078+02:00 es_data_3 kernel: [569182.174059]  ? page_fault_oops+0xd2/0x2b0
2023-10-09T00:27:32.279078+02:00 es_data_3 kernel: [569182.174068]  ? exc_page_fault+0x70/0x170
2023-10-09T00:27:32.279079+02:00 es_data_3 kernel: [569182.174075]  ? asm_exc_page_fault+0x22/0x30
2023-10-09T00:27:32.279080+02:00 es_data_3 kernel: [569182.174084]  ? __filemap_get_folio+0xad/0x340
2023-10-09T00:27:32.279080+02:00 es_data_3 kernel: [569182.174092]  filemap_fault+0x65/0x910
2023-10-09T00:27:32.279082+02:00 es_data_3 kernel: [569182.174099]  ? preempt_count_add+0x47/0xa0
2023-10-09T00:27:32.279083+02:00 es_data_3 kernel: [569182.174107]  __do_fault+0x30/0x110
2023-10-09T00:27:32.279084+02:00 es_data_3 kernel: [569182.174114]  do_fault+0x1b9/0x410
2023-10-09T00:27:32.279084+02:00 es_data_3 kernel: [569182.174121]  __handle_mm_fault+0x660/0xfa0
2023-10-09T00:27:32.279085+02:00 es_data_3 kernel: [569182.174130]  handle_mm_fault+0xdb/0x2d0
2023-10-09T00:27:32.279085+02:00 es_data_3 kernel: [569182.174137]  do_user_addr_fault+0x19c/0x570
2023-10-09T00:27:32.279086+02:00 es_data_3 kernel: [569182.174144]  exc_page_fault+0x70/0x170
2023-10-09T00:27:32.279088+02:00 es_data_3 kernel: [569182.174151]  asm_exc_page_fault+0x22/0x30
2023-10-09T00:27:32.279088+02:00 es_data_3 kernel: [569182.174158] RIP: 0033:0x7fdcbe97a0aa
2023-10-09T00:27:32.279089+02:00 es_data_3 kernel: [569182.174429] Code: 0f 85 32 10 00 00 45 8b 41 18 41 8b 49 1c 44 3b c1 0f 8d 55 10 00 00 49 8b 49 10 41 8b f8 ff c7 41 89 79 18 4c 8b c9 4d 63 c0 <47> 0f be 04 01 45 85 c0 0f 8c 95 01 00 00 44 89 44 24 38 66 66 90
2023-10-09T00:27:32.279090+02:00 es_data_3 kernel: [569182.174970] RSP: 002b:00007fb9720fda40 EFLAGS: 00010207
2023-10-09T00:27:32.279090+02:00 es_data_3 kernel: [569182.175241] RAX: 0000000005f357bc RBX: 00000000e9690431 RCX: 00007f4723f954dc
2023-10-09T00:27:32.279091+02:00 es_data_3 kernel: [569182.175512] RDX: 00000000e9690405 RSI: 000000074b482188 RDI: 0000000005f357bd
2023-10-09T00:27:32.279094+02:00 es_data_3 kernel: [569182.175783] RBP: 000000074b482558 R08: 0000000005f357bc R09: 00007f4723f954dc
2023-10-09T00:27:32.279094+02:00 es_data_3 kernel: [569182.176036] R10: 000000074b482558 R11: 00000000e96904c1 R12: 0000000000000000
2023-10-09T00:27:32.279095+02:00 es_data_3 kernel: [569182.176283] R13: 0000000000000001 R14: 000000074b4831d0 R15: 00007fdb10116380
2023-10-09T00:27:32.279096+02:00 es_data_3 kernel: [569182.176533]  </TASK>
2023-10-09T00:27:32.279096+02:00 es_data_3 kernel: [569182.176781] Modules linked in: tls nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nf_log_syslog nft_log nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink binfmt_misc nls_ascii nls_cp437 vfat fat ext4 intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common crc16 mbcache jbd2 isst_if_common skx_edac nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass ghash_clmulni_intel ipmi_ssif sha512_ssse3 sha512_generic aesni_intel crypto_simd cryptd rapl mgag200 intel_cstate drm_shmem_helper hpwdt evdev intel_uncore pcspkr drm_kms_helper acpi_ipmi hpilo watchdog intel_pch_thermal ipmi_si mei_me ipmi_devintf ipmi_msghandler mei sg acpi_tad ioatdma acpi_power_meter button drm fuse loop configfs efi_pstore efivarfs ip_tables x_tables autofs4 xfs uas usb_storage dm_mod sd_mod t10_pi ses crc64_rocksoft enclosure crc64 crc_t10dif
2023-10-09T00:27:32.279097+02:00 es_data_3 kernel: [569182.176833]  crct10dif_generic xhci_pci xhci_hcd smartpqi bnx2x ehci_pci ehci_hcd scsi_transport_sas igb scsi_mod usbcore crct10dif_pclmul crct10dif_common crc32_pclmul i2c_algo_bit dca scsi_common mdio libcrc32c crc32c_generic lpc_ich crc32c_intel wmi usb_common
2023-10-09T00:27:32.279099+02:00 es_data_3 kernel: [569182.179928] CR2: 0000000000000036
2023-10-09T00:27:32.279099+02:00 es_data_3 kernel: [569182.180211] ---[ end trace 0000000000000000 ]---
2023-10-09T00:27:32.279100+02:00 es_data_3 kernel: [569182.253423] RIP: 0010:__filemap_get_folio+0xad/0x340
2023-10-09T00:27:32.279100+02:00 es_data_3 kernel: [569182.253812] Code: 10 e8 37 69 75 00 48 89 c3 48 3d 02 04 00 00 74 e2 48 3d 06 04 00 00 74 da 48 85 c0 0f 84 25 02 00 00 a8 01 0f 85 27 02 00 00 <8b> 40 34 85 c0 74 c2 8d 50 01 f0 0f b1 53 34 75 f2 48 8b 54 24 28
2023-10-09T00:27:32.279103+02:00 es_data_3 kernel: [569182.254505] RSP: 0000:ffffb7c40fbf7c70 EFLAGS: 00010246
2023-10-09T00:27:32.279103+02:00 es_data_3 kernel: [569182.254854] RAX: 0000000000000002 RBX: 0000000000000002 RCX: 0000000000000002
2023-10-09T00:27:32.279104+02:00 es_data_3 kernel: [569182.255203] RDX: 0000000000000008 RSI: ffff96e69051ada0 RDI: ffffb7c40fbf7c80
2023-10-09T00:27:32.279104+02:00 es_data_3 kernel: [569182.255554] RBP: 0000000000000000 R08: 00000000000ee6cf R09: 00000000000ee6d0
2023-10-09T00:27:32.279105+02:00 es_data_3 kernel: [569182.255909] R10: ffffffffffffffc0 R11: 0000000000000000 R12: 0000000000000000
2023-10-09T00:27:32.279105+02:00 es_data_3 kernel: [569182.256267] R13: ffff96e45fb6dab0 R14: 00000000000ee6ca R15: ffff96e44f0e18e0
2023-10-09T00:27:32.279106+02:00 es_data_3 kernel: [569182.256599] FS:  00007fb9720ff6c0(0000) GS:ffff96ec1fac0000(0000) knlGS:0000000000000000
2023-10-09T00:27:32.279108+02:00 es_data_3 kernel: [569182.256936] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-10-09T00:27:32.279108+02:00 es_data_3 kernel: [569182.257275] CR2: 0000000000000036 CR3: 0000000e0b280003 CR4: 00000000007706e0
2023-10-09T00:27:32.279109+02:00 es_data_3 kernel: [569182.257617] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2023-10-09T00:27:32.279109+02:00 es_data_3 kernel: [569182.257960] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2023-10-09T00:27:32.279110+02:00 es_data_3 kernel: [569182.258306] PKRU: 55555554
2023-10-09T00:27:32.279110+02:00 es_data_3 kernel: [569182.258651] note: elasticsearch[e[30752] exited with irqs disabled

There is no known issue with this kernel version, but that is more likely due to lack of experience than because we are sure it works. The crash dump you shared looks like a serious kernel bug, but it's not something that Elasticsearch can do anything about.

Quite possibly we just haven't got around to adding it to the test suite yet.

I can't give a timeline for this work unfortunately. Especially if it has kernel bugs.

It's ultimately up to you, it's your system, but I would recommend running using one of the configurations in the support matrix. These are the only ones against which we run any tests.

All right, we will probably wait for the moment and observe how often something like this happens. Thanks for the quick reply!

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.