Elasticsearch crashes after update to 2.0


(Monotek) #1

I have 3 data nodes on Ubuntu 14.04 (KVM), Oracle Java 8, 2 Cores and 8 GB of RAM.
2 Replicas. About 20 million documents. Heap size 4 GB.

After updating from 1.7.3 to 2.0 my load increases from around 0.5 to 3 - 4.

In the beginning it was even worse because of relocation but now it should more or less idle but the load is much bigger as with 1.7.3

After some time one of the nodes has even higher load (> 20), the other 2 nodes get load 0.1 and elasticsearch does not work anymore. If i try to stop elasticsearch on the node with the high load, it becomeas a zombie.

Syslog for the node with the high load says:

Oct 30 21:13:03 es1 kernel: [22200.884124] INFO: task jbd2/vdb1-8:307 blocked for more than 120 seconds.
Oct 30 21:13:03 es1 kernel: [22200.892344] Not tainted 3.13.0-66-generic #108-Ubuntu
Oct 30 21:13:03 es1 kernel: [22200.898723] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 30 21:13:03 es1 kernel: [22200.904917] jbd2/vdb1-8 D ffff88023fd13180 0 307 2 0x00000000
Oct 30 21:13:03 es1 kernel: [22200.904922] ffff8802322bdcb0 0000000000000046 ffff8802322ab000 ffff8802322bdfd8
Oct 30 21:13:03 es1 kernel: [22200.904925] 0000000000013180 0000000000013180 ffff8802322ab000 ffff8802322bdd98
Oct 30 21:13:03 es1 kernel: [22200.904931] ffff8800370480b8 ffff8802322ab000 ffff88013eef2800 ffff8802322bdd80
Oct 30 21:13:03 es1 kernel: [22200.904934] Call Trace:
Oct 30 21:13:03 es1 kernel: [22200.904969] [] schedule+0x29/0x70
Oct 30 21:13:03 es1 kernel: [22200.9ies completly and the oth04988] [] jbd2_journal_commit_transaction+0x287/0x1ab0
Oct 30 21:13:03 es1 kernel: [22200.905001] [] ? update_curr+0xe4/0x180
Oct 30 21:13:03 es1 kernel: [22200.905007] [] ? prepare_to_wait_event+0x100/0x100
Oct 30 21:13:03 es1 kernel: [22200.905014] [] ? lock_timer_base.isra.35+0x2b/0x50
Oct 30 21:13:03 es1 kernel: [22200.905017] [] ? try_to_del_timer_sync+0x4f/0x70
Oct 30 21:13:03 es1 kernel: [22200.905020] [] kjournald2+0xbd/0x250
Oct 30 21:13:03 es1 kernel: [22200.905023] [] ? prepare_to_wait_event+0x100/0x100
Oct 30 21:13:03 es1 kernel: [22200.905025] [] ? commit_timeout+0x10/0x10
Oct 30 21:13:03 es1 kernel: [22200.905033] [] kthread+0xd2/0xf0
Oct 30 21:13:03 es1 kernel: [22200.905036] [] ? kthread_create_on_node+0x1c0/0x1c0
Oct 30 21:13:03 es1 kernel: [22200.905041] [] ret_from_fork+0x58/0x90
Oct 30 21:13:03 es1 kernel: [22200.905044] [] ? kthread_create_on_node+0x1c0/0x1c0

Nothing for this time in elasticsearch.log

The elasticsearch log of the other nodes say things like:

[2015-10-31 04:20:14,582][INFO ][rest.suppressed ] //otrs-2015.10/tickets/19095646 Params: {index=otrs-2015.10, id=19095646, type=tickets}
RemoteTransportException[[es1][192.168.10.61:9300][indices:data/write/index]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1@248ad78f on EsThreadPoolExecutor[index, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1205638d[Running, pool size = 2, active threads = 2, queued tasks = 209, completed tasks = 42298]]];
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1@248ad78f on EsThreadPoolExecutor[index, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1205638d[Running, pool size = 2, active threads = 2, queued tasks = 209, completed tasks = 42298]]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
...

[2015-10-31 00:11:55,886][DEBUG][action.admin.cluster.node.stats] [es2] failed to execute on node [egOv1fGiRvmxV8BrvXfpnQ]
ReceiveTimeoutTransportException[[es1][192.168.10.61:9300][cluster:monitor/nodes/stats[n]] request_id [246940] timed out after [15000ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:645)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Any hints?


(Jörg Prante) #2

Upgrade your Linux to a more recent kernel.

Restart KVM. You created heavy I/O load on KVM and it breaks, because of kernel bugs.

The ES messages you see are just because of underlying kernel problems which are not related to ES .


(Monotek) #3

I'm not sure that this is the right solution.
I ran Elasticsearch from 0.9 - 1.7 for months without any problems on the same VMs and only Elasticsearch 2.0 has this problem.

We run about 40 VMs which are all very stable.
All KVM hosts and nodes run on Ubuntu 14.04.
Are there any known problems with Elasticsearch and this kernel?


(Jörg Prante) #4

As said, this is unrelated to ES.

It is in indirect issue triggered in your Linux kernel, because heavy IO load is generated.


(Monotek) #5

So why was it working without problems for months before the update to Elasticsearch 2.0?

Anyway. Updated to Kernel 3.19. Lets see if it helps.


Elasticsearch 2,x and Linux kernel versions tested
(David Pilato) #6

Probably because we now fsync each operation instead of doing that every 5 seconds.
If you are not using bulk for injection that could cause a lot of IO I guess.


(Jörg Prante) #7

To mitigate the situation, you should create a separate Linux filesystem for path.data where the Elasticsearch data files live. Then, mount the filesystem with options

data=writeback,relatime,nobarrier

These options take some load from ext4 journal writing, efffectively reducing I/O load, and the chance of dead locks in the kernel.

More info see man tune2fs


(Jörg Prante) #8

A short explanation of the kernel errors with task jbd2/vdb1-8:307 blocked for more than 120 seconds

Normally, all I/O activity (file writing) is cached in file system buffers. Depending on the size of the buffer, Linux must write all dirty buffers to disk. To do this, there are 120 seconds available. In this time, all memory pages must be visited, prepared for I/O, and the file system driver must write journal metadata, beside of the page data.

There can be two situations to exceed 120 seconds:

  • the file system buffers are so large that it takes longer than 120 seconds to write the data and the journal physically to disk

  • a file system kernel I/O thread can not complete in 120 seconds, maybe because of a dead lock, because kernel I/O threads wait for each other before continue.

The journal block data layer 2 (jbd2) threads have to be synchronized with the page data writes. At file system mount time, it can be specified how the journal and the page data writes are coordinated. So , by changing the mount options to relax the journal writes a bit, it is possible to work around dead locks.

Elasticsearch, running on a Java Virtual Machine, has limited influence on kernel behavior. The admin is in charge, by sizing the file system buffer, and by preparing the file systems for the Elasticsearch data, and by running ES in a way that the created I/O load can be handled by the system. Slow reads/writes and high iowait numbers always have a cause, maybe because of hardware (disk device/network) incidents, maybe by configurations that can not handle the I/O load.


Double CPU usage after 1.4.5 to 2.0
(Monotek) #9

thanks guys! :slight_smile:

indeed it seems that it was an io problem.
my path.data dir was already an own partition which mounts a glusterfs volume via libgfapi to prevent context switches of fuse mounts. it looks like this was still not fast enough for the new fsync implementation in 2.0.

the mount options "data=writeback,relatime,nobarrier" took a lot of io pressure from the system now.
load seems to be normal again.

nevertheless i think i will move to real hardware with ssds as storage in the near future.

any recommendations, aside from ext4 with optimized mount parameters, for the filesystem?


(system) #10