EC2 instance hanging after a few hours


(Douglas Muth) #1

Good Morning,

I'm a long-time listener, first-time caller. :slight_smile:

I'm having some problems trying to get Elastic Search working at
$WORK. Specifically, that the machine it's running on becomes
unresponsive after a few hours of indexing. This is on a virgin
system that doesn't run anything else. Here's my configuration:

  • Amazon EC2 "Large" instance

  • Ubuntu 10.10, EBS-backed (Kernel ID aki-427d952b, AMI ID
    ami-548c783d)

  • Java 1.6.0_20

  • Max files set to 65k, and I verified this by watching the log
    messages when Elastic Search starts up

  • Elastic Search is spawned by Daemontools, with the following command
    line: export ES_JAVA_OPTS="-server"; exec /usr/local/elasticsearch/bin/
    elasticsearch -f -Des.max-open-files=true

  • Everything else is at its default setting

The issue is that Elastic search starts up fine, and runs fine, but if
I start indexing documents, after some hours, the machine will hang.
It still be responsive to ping, but any TCP connections such as SSH
will time out. According to Cloudwatch, the CPU and Network usage
drops to zero, so nothing on the machine is actually doing aynthing.
Examining the machine after the fact does not yield anything
interesting, as as /var/log/messages does not contain anything
unusual. I'm guessing some weird kernel issue is coming up, but I
cannot prove this.

I should also mention that we're subjecting Elastic Search to a VERY
high write load. Specifically, I'm using node.js to read many small
documents out of a database and add them to elastic search. We have
about 50 million documents in total, stored into a single index, at
the rate of 2,000 documents per second. In case it's worth
mentioning, those documents have a field that is of the MySQL BINARY
datatype. (since JSON usually isn't about binary data)

FWIW, the MySQL database is an RDS instance, and has given us zero
problems.

I'm out of ideas, because I've never ran into a problem like this
before. Does anyone have any suggestions for parameters I can check
or tweak, or things I can log to get to the bottom of this? I've
never really seen anything like this before, and any help

Thanks for your time,

-- Doug


(Weiwei Wang) #2

same problem for me with large index rebuild(2620000+ documents), and
also too much memory is used(5g+)

On Dec 16, 12:16 am, Douglas Muth doug.m...@gmail.com wrote:

Good Morning,

I'm a long-time listener, first-time caller. :slight_smile:

I'm having some problems trying to get Elastic Search working at
$WORK. Specifically, that the machine it's running on becomes
unresponsive after a few hours of indexing. This is on a virgin
system that doesn't run anything else. Here's my configuration:

  • Amazon EC2 "Large" instance

  • Ubuntu 10.10, EBS-backed (Kernel ID aki-427d952b, AMI ID
    ami-548c783d)

  • Java 1.6.0_20

  • Max files set to 65k, and I verified this by watching the log
    messages when Elastic Search starts up

  • Elastic Search is spawned by Daemontools, with the following command
    line: export ES_JAVA_OPTS="-server"; exec /usr/local/elasticsearch/bin/
    elasticsearch -f -Des.max-open-files=true

  • Everything else is at its default setting

The issue is that Elastic search starts up fine, and runs fine, but if
I start indexing documents, after some hours, the machine will hang.
It still be responsive to ping, but any TCP connections such as SSH
will time out. According to Cloudwatch, the CPU and Network usage
drops to zero, so nothing on the machine is actually doing aynthing.
Examining the machine after the fact does not yield anything
interesting, as as /var/log/messages does not contain anything
unusual. I'm guessing some weird kernel issue is coming up, but I
cannot prove this.

I should also mention that we're subjecting Elastic Search to a VERY
high write load. Specifically, I'm using node.js to read many small
documents out of a database and add them to elastic search. We have
about 50 million documents in total, stored into a single index, at
the rate of 2,000 documents per second. In case it's worth
mentioning, those documents have a field that is of the MySQL BINARY
datatype. (since JSON usually isn't about binary data)

FWIW, the MySQL database is an RDS instance, and has given us zero
problems.

I'm out of ideas, because I've never ran into a problem like this
before. Does anyone have any suggestions for parameters I can check
or tweak, or things I can log to get to the bottom of this? I've
never really seen anything like this before, and any help

Thanks for your time,

-- Doug


(Douglas Muth) #3

On Thu, Dec 15, 2011 at 9:32 PM, Weiwei Wang ww.wang.cs@gmail.com wrote:

same problem for me with large index rebuild(2620000+ documents), and
also too much memory is used(5g+)

I can confirm in my case that memory is NOT an issue. The instance in
question has ~7 GB of RAM,and according to my Munin graphs, total
memory usage on that box doesn't go above 1.5 GB.

-- Doug

--
http://www.dmuth.org/
http://twitter.com/dmuth


(Shay Banon) #4

Few points:

  1. I suggest using a newer Java version, 1.6_20 is quite old.
  2. The fact that the machine has 7gb does not mean elasticsearch will use
    it. I suggest to allocate to ES half the machine memory, in your case,
    ~3.5gb. See more here:
    http://www.elasticsearch.org/guide/reference/setup/installation.html.
  3. If you can, use a larger instance (m1.xlarge), it "suffers" from
    noisy neighbors on aws less.

The fact that you can't connect to the machine might mean you are running
out of sockets and they are being throttled by the OS. Can you monitor it?
Are you using persistent connections to ES from node? netstat and lsof are
your friends here to check it.

I also saw this behavior way back with the AWS problems with ubuntu 10.04,
recently, I started to read that 10.10 has started to exhibit
similar behavior (see the instagram blog:


).

Things to monitor on ES is mainly the memory usage for this, node stats API
is your friend here (or big desk plugin:
http://www.elasticsearch.org/guide/reference/modules/plugins.html).

On Fri, Dec 16, 2011 at 4:38 AM, Douglas Muth doug.muth@gmail.com wrote:

On Thu, Dec 15, 2011 at 9:32 PM, Weiwei Wang ww.wang.cs@gmail.com wrote:

same problem for me with large index rebuild(2620000+ documents), and
also too much memory is used(5g+)

I can confirm in my case that memory is NOT an issue. The instance in
question has ~7 GB of RAM,and according to my Munin graphs, total
memory usage on that box doesn't go above 1.5 GB.

-- Doug

--
http://www.dmuth.org/
http://twitter.com/dmuth


(Douglas Muth) #5

On Fri, Dec 16, 2011 at 12:41 PM, Shay Banon kimchy@gmail.com wrote:

Few points:

  1. I suggest using a newer Java version, 1.6_20 is quite old.
  2. The fact that the machine has 7gb does not mean elasticsearch will use
    it. I suggest to allocate to ES half the machine memory, in your case,
    ~3.5gb. See more
    here: http://www.elasticsearch.org/guide/reference/setup/installation.html.
  3. If you can, use a larger instance (m1.xlarge), it "suffers" from
    noisy neighbors on aws less.

#1 and #2 are easy enough for me to do. #3 might be more of a
challenge, since it will cost us more. :stuck_out_tongue:

The fact that you can't connect to the machine might mean you are running
out of sockets and they are being throttled by the OS. Can you monitor it?
Are you using persistent connections to ES from node? netstat and lsof are
your friends here to check it.

The connections are not persistent, but I use the generic-pool module
for node.js, which I use to limit to 10 slots, or 10 concurrent
connections to Elastic Search. I did check things with netstat and
lsof, and there are no issues there.

I also saw this behavior way back with the AWS problems with ubuntu 10.04,
recently, I started to read that 10.10 has started to exhibit
similar behavior (see the instagram
blog: http://instagram-engineering.tumblr.com/post/13649370142/what-powers-instagram-hundreds-of-instances-dozens-of).

Interesting, as that's the first I heard of any concerns with 10.10.
We're running our entire infrastructure on 10.10 and have an average
of 1 freeze like this per machine per 6 months. (current issues
excepted)

Things to monitor on ES is mainly the memory usage for this, node stats API
is your friend here (or big desk
plugin: http://www.elasticsearch.org/guide/reference/modules/plugins.html).

Big Desk looks very very cool. Many thanks for that, and the other pointers!

All the best,

-- Doug


(Shay Banon) #6

Can you try and move to use persistent connections, see if it helps?

On Fri, Dec 16, 2011 at 8:08 PM, Douglas Muth doug.muth@gmail.com wrote:

On Fri, Dec 16, 2011 at 12:41 PM, Shay Banon kimchy@gmail.com wrote:

Few points:

  1. I suggest using a newer Java version, 1.6_20 is quite old.
  2. The fact that the machine has 7gb does not mean elasticsearch will use
    it. I suggest to allocate to ES half the machine memory, in your case,
    ~3.5gb. See more
    here:
    http://www.elasticsearch.org/guide/reference/setup/installation.html.
  3. If you can, use a larger instance (m1.xlarge), it "suffers" from
    noisy neighbors on aws less.

#1 and #2 are easy enough for me to do. #3 might be more of a
challenge, since it will cost us more. :stuck_out_tongue:

The fact that you can't connect to the machine might mean you are running
out of sockets and they are being throttled by the OS. Can you monitor
it?
Are you using persistent connections to ES from node? netstat and lsof
are
your friends here to check it.

The connections are not persistent, but I use the generic-pool module
for node.js, which I use to limit to 10 slots, or 10 concurrent
connections to Elastic Search. I did check things with netstat and
lsof, and there are no issues there.

I also saw this behavior way back with the AWS problems with ubuntu
10.04,
recently, I started to read that 10.10 has started to exhibit
similar behavior (see the instagram
blog:
http://instagram-engineering.tumblr.com/post/13649370142/what-powers-instagram-hundreds-of-instances-dozens-of
).

Interesting, as that's the first I heard of any concerns with 10.10.
We're running our entire infrastructure on 10.10 and have an average
of 1 freeze like this per machine per 6 months. (current issues
excepted)

Things to monitor on ES is mainly the memory usage for this, node stats
API
is your friend here (or big desk
plugin:
http://www.elasticsearch.org/guide/reference/modules/plugins.html).

Big Desk looks very very cool. Many thanks for that, and the other
pointers!

All the best,

-- Doug


(system) #7