Hi,
Over the last few weeks I've seen ES becoming unresponsive. The HTTP
interface just hangs until the browser times it out. I'm unable to stop the
process via start-stop-daemon. The only effective way I've found to stop
elasticsearch is to kill -9 PID
. When I start the service again,
everything works fine. I observed the problem on 0.18.2 after months of
successful operation. The ES is a one-node service.
I suspected it might be caused by either bad data or a bug, so I upgraded
to 0.19.1 without migrating the data and did a fresh re-import of the data
on 0.19.1, but I'm still seeing this unwanted behavior.
Environment:
- Debian squeeze 2.6.32-5-amd64
- java version "1.6.0_18"
- OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
- ES 0.19.1 and 0.18.2
Running strace with -c when the server is unresponsive returns:
% time seconds usecs/call calls errors syscall
48.11 4.088616 711 5748 909 futex
30.17 2.564161 2827 907 epoll_wait
21.70 1.844104 41911 44 7 restart_syscall
0.01 0.000634 0 1354 pread
0.00 0.000067 1 75 write
0.00 0.000029 1 47 17 epoll_ctl
0.00 0.000014 0 262 mprotect
0.00 0.000014 0 62 madvise
0.00 0.000010 0 100 close
0.00 0.000009 0 75 setsockopt
0.00 0.000000 0 76 15 read
0.00 0.000000 0 56 open
0.00 0.000000 0 128 32 stat
[ ... snip... ]
100.00 8.497658 9692 1005 total
The full and formatted strace summary is available
here: https://gist.github.com/2405701#file_strace.txt
In a detailed strace I see a lot of entries like this:
15171 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)
Detailed strace is available
here: https://gist.github.com/2405701#file_strace+details.txt
Any idea how to fix this or at least how to debug further?
Thanks,
Mattias