ElasticSearch becomes unresponsive

Hi,

Over the last few weeks I've seen ES becoming unresponsive. The HTTP
interface just hangs until the browser times it out. I'm unable to stop the
process via start-stop-daemon. The only effective way I've found to stop
elasticsearch is to kill -9 PID. When I start the service again,
everything works fine. I observed the problem on 0.18.2 after months of
successful operation. The ES is a one-node service.

I suspected it might be caused by either bad data or a bug, so I upgraded
to 0.19.1 without migrating the data and did a fresh re-import of the data
on 0.19.1, but I'm still seeing this unwanted behavior.

Environment:

  • Debian squeeze 2.6.32-5-amd64
  • java version "1.6.0_18"
  • OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
  • ES 0.19.1 and 0.18.2

Running strace with -c when the server is unresponsive returns:

% time seconds usecs/call calls errors syscall


48.11 4.088616 711 5748 909 futex
30.17 2.564161 2827 907 epoll_wait
21.70 1.844104 41911 44 7 restart_syscall
0.01 0.000634 0 1354 pread
0.00 0.000067 1 75 write
0.00 0.000029 1 47 17 epoll_ctl
0.00 0.000014 0 262 mprotect
0.00 0.000014 0 62 madvise
0.00 0.000010 0 100 close
0.00 0.000009 0 75 setsockopt
0.00 0.000000 0 76 15 read
0.00 0.000000 0 56 open
0.00 0.000000 0 128 32 stat
[ ... snip... ]


100.00 8.497658 9692 1005 total

The full and formatted strace summary is available
here: https://gist.github.com/2405701#file_strace.txt

In a detailed strace I see a lot of entries like this:

15171 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Detailed strace is available
here: https://gist.github.com/2405701#file_strace+details.txt

Any idea how to fix this or at least how to debug further?

Thanks,

Mattias

Hi,

First, can you use a newer version of Java? THe one you use is very old,
the latest 1.6 version is update 31. If this does not work, see if you can
issue jstack against it when its in this state:
jstack - Stack Trace and
gist it. Also, if you can monitor its heap usage memory and see if its
possibly running out of memory would help (I assume there is nothing in the
logs...).

-shay.banon

On Tue, Apr 17, 2012 at 3:39 PM, Mattias Pfeiffer mattias@pfeiffer.dkwrote:

Hi,

Over the last few weeks I've seen ES becoming unresponsive. The HTTP
interface just hangs until the browser times it out. I'm unable to stop the
process via start-stop-daemon. The only effective way I've found to stop
elasticsearch is to kill -9 PID. When I start the service again,
everything works fine. I observed the problem on 0.18.2 after months of
successful operation. The ES is a one-node service.

I suspected it might be caused by either bad data or a bug, so I upgraded
to 0.19.1 without migrating the data and did a fresh re-import of the data
on 0.19.1, but I'm still seeing this unwanted behavior.

Environment:

  • Debian squeeze 2.6.32-5-amd64
  • java version "1.6.0_18"
  • OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
  • ES 0.19.1 and 0.18.2

Running strace with -c when the server is unresponsive returns:

% time seconds usecs/call calls errors syscall


48.11 4.088616 711 5748 909 futex
30.17 2.564161 2827 907 epoll_wait
21.70 1.844104 41911 44 7 restart_syscall
0.01 0.000634 0 1354 pread
0.00 0.000067 1 75 write
0.00 0.000029 1 47 17 epoll_ctl
0.00 0.000014 0 262 mprotect
0.00 0.000014 0 62 madvise
0.00 0.000010 0 100 close
0.00 0.000009 0 75 setsockopt
0.00 0.000000 0 76 15 read
0.00 0.000000 0 56 open
0.00 0.000000 0 128 32 stat
[ ... snip... ]


100.00 8.497658 9692 1005 total

The full and formatted strace summary is available here:
elasticsearch strace · GitHub

In a detailed strace I see a lot of entries like this:

15171 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out)

Detailed strace is available here:
elasticsearch strace · GitHub

Any idea how to fix this or at least how to debug further?

Thanks,

Mattias