I have 3 data nodes on Ubuntu 14.04 (KVM), Oracle Java 8, 2 Cores and 8 GB of RAM.
2 Replicas. About 20 million documents. Heap size 4 GB.
After updating from 1.7.3 to 2.0 my load increases from around 0.5 to 3 - 4.
In the beginning it was even worse because of relocation but now it should more or less idle but the load is much bigger as with 1.7.3
After some time one of the nodes has even higher load (> 20), the other 2 nodes get load 0.1 and elasticsearch does not work anymore. If i try to stop elasticsearch on the node with the high load, it becomeas a zombie.
Syslog for the node with the high load says:
Oct 30 21:13:03 es1 kernel: [22200.884124] INFO: task jbd2/vdb1-8:307 blocked for more than 120 seconds.
Oct 30 21:13:03 es1 kernel: [22200.892344] Not tainted 3.13.0-66-generic #108-Ubuntu
Oct 30 21:13:03 es1 kernel: [22200.898723] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 30 21:13:03 es1 kernel: [22200.904917] jbd2/vdb1-8 D ffff88023fd13180 0 307 2 0x00000000
Oct 30 21:13:03 es1 kernel: [22200.904922] ffff8802322bdcb0 0000000000000046 ffff8802322ab000 ffff8802322bdfd8
Oct 30 21:13:03 es1 kernel: [22200.904925] 0000000000013180 0000000000013180 ffff8802322ab000 ffff8802322bdd98
Oct 30 21:13:03 es1 kernel: [22200.904931] ffff8800370480b8 ffff8802322ab000 ffff88013eef2800 ffff8802322bdd80
Oct 30 21:13:03 es1 kernel: [22200.904934] Call Trace:
Oct 30 21:13:03 es1 kernel: [22200.904969] [] schedule+0x29/0x70
Oct 30 21:13:03 es1 kernel: [22200.9ies completly and the oth04988] [] jbd2_journal_commit_transaction+0x287/0x1ab0
Oct 30 21:13:03 es1 kernel: [22200.905001] [] ? update_curr+0xe4/0x180
Oct 30 21:13:03 es1 kernel: [22200.905007] [] ? prepare_to_wait_event+0x100/0x100
Oct 30 21:13:03 es1 kernel: [22200.905014] [] ? lock_timer_base.isra.35+0x2b/0x50
Oct 30 21:13:03 es1 kernel: [22200.905017] [] ? try_to_del_timer_sync+0x4f/0x70
Oct 30 21:13:03 es1 kernel: [22200.905020] [] kjournald2+0xbd/0x250
Oct 30 21:13:03 es1 kernel: [22200.905023] [] ? prepare_to_wait_event+0x100/0x100
Oct 30 21:13:03 es1 kernel: [22200.905025] [] ? commit_timeout+0x10/0x10
Oct 30 21:13:03 es1 kernel: [22200.905033] [] kthread+0xd2/0xf0
Oct 30 21:13:03 es1 kernel: [22200.905036] [] ? kthread_create_on_node+0x1c0/0x1c0
Oct 30 21:13:03 es1 kernel: [22200.905041] [] ret_from_fork+0x58/0x90
Oct 30 21:13:03 es1 kernel: [22200.905044] [] ? kthread_create_on_node+0x1c0/0x1c0
Nothing for this time in elasticsearch.log
The elasticsearch log of the other nodes say things like:
[2015-10-31 04:20:14,582][INFO ][rest.suppressed ] //otrs-2015.10/tickets/19095646 Params: {index=otrs-2015.10, id=19095646, type=tickets}
RemoteTransportException[[es1][192.168.10.61:9300][indices:data/write/index]]; nested: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1@248ad78f on EsThreadPoolExecutor[index, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1205638d[Running, pool size = 2, active threads = 2, queued tasks = 209, completed tasks = 42298]]];
Caused by: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1@248ad78f on EsThreadPoolExecutor[index, queue capacity = 200, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@1205638d[Running, pool size = 2, active threads = 2, queued tasks = 209, completed tasks = 42298]]]
at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
...
[2015-10-31 00:11:55,886][DEBUG][action.admin.cluster.node.stats] [es2] failed to execute on node [egOv1fGiRvmxV8BrvXfpnQ]
ReceiveTimeoutTransportException[[es1][192.168.10.61:9300][cluster:monitor/nodes/stats[n]] request_id [246940] timed out after [15000ms]]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:645)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any hints?