Elasticsearch crashes after update to 2.0

A short explanation of the kernel errors with task jbd2/vdb1-8:307 blocked for more than 120 seconds

Normally, all I/O activity (file writing) is cached in file system buffers. Depending on the size of the buffer, Linux must write all dirty buffers to disk. To do this, there are 120 seconds available. In this time, all memory pages must be visited, prepared for I/O, and the file system driver must write journal metadata, beside of the page data.

There can be two situations to exceed 120 seconds:

  • the file system buffers are so large that it takes longer than 120 seconds to write the data and the journal physically to disk

  • a file system kernel I/O thread can not complete in 120 seconds, maybe because of a dead lock, because kernel I/O threads wait for each other before continue.

The journal block data layer 2 (jbd2) threads have to be synchronized with the page data writes. At file system mount time, it can be specified how the journal and the page data writes are coordinated. So , by changing the mount options to relax the journal writes a bit, it is possible to work around dead locks.

Elasticsearch, running on a Java Virtual Machine, has limited influence on kernel behavior. The admin is in charge, by sizing the file system buffer, and by preparing the file systems for the Elasticsearch data, and by running ES in a way that the created I/O load can be handled by the system. Slow reads/writes and high iowait numbers always have a cause, maybe because of hardware (disk device/network) incidents, maybe by configurations that can not handle the I/O load.