Bizarre behaviour with ES process

Hi,

I've had my application running in PROD for months, and suddenly yesterday
my server got stuck and since then I've started to notice a very weird
behaviour monitoring the ES process. I know this may sound strange, but this
is exactly the description of what's happening now (apologize if I say
something with no sense, but I'm desperated right now):

  • I restart the server and start up a Tomcat webapp application and ES
    server (both in same machine).
  • Tomcat is configured to use 1536Mb, whereas ES uses 900Mb. This is a
    CentOS5.5. server with 3Gb of physical RAM.
  • If I run a search in the webapp application, which queries internally
    using a Transport client the ES server, then I can see the CPU (using 'top')
    for the ES Java process growing to 99.9%, then maybe falling to 67%, then
    again to 98.8%, etc., for around 10 seconds, and finally my webapp replies
    displaying the results.
  • After this very first query, I monitor the CPU for the ES Java process and
    I can see sort of a constant repetitive series of 0% -> 5% -> 15% -> 0% ->
    5% -> 15% -> etc, without making any request through the web site.
  • From now, if I execute the queries direcly in the server using 'curl',
    they respond in ms, but if I run the same query through the web, then I see
    the ES Java process grow up to 99% as explained before for several seconds
    (so this discards the possibility of Tomcat Java process eating the CPU).

Has anyone experienced this or similar issues? Is this an OS issue? How can
it be that it stops working properly suddenly, when no new data has been
added or removed? What can I check?

Thanks.

Indeed, very strange. CPU spikes are hard to solve since one first needs to understand where they are happening. To start, are you sure you are running the same version on both the server and the client?

The fastest way to test it is if you can run (j)visualvm on the relevant server. You can do simple profiling there and maybe send a screenshot of what takes most CPU.

On Wednesday, June 1, 2011 at 12:46 AM, Enrique Medina Montenegro wrote:

Hi,

I've had my application running in PROD for months, and suddenly yesterday my server got stuck and since then I've started to notice a very weird behaviour monitoring the ES process. I know this may sound strange, but this is exactly the description of what's happening now (apologize if I say something with no sense, but I'm desperated right now):

  • I restart the server and start up a Tomcat webapp application and ES server (both in same machine).
  • Tomcat is configured to use 1536Mb, whereas ES uses 900Mb. This is a CentOS5.5. server with 3Gb of physical RAM.
  • If I run a search in the webapp application, which queries internally using a Transport client the ES server, then I can see the CPU (using 'top') for the ES Java process growing to 99.9%, then maybe falling to 67%, then again to 98.8%, etc., for around 10 seconds, and finally my webapp replies displaying the results.
  • After this very first query, I monitor the CPU for the ES Java process and I can see sort of a constant repetitive series of 0% -> 5% -> 15% -> 0% -> 5% -> 15% -> etc, without making any request through the web site.
  • From now, if I execute the queries direcly in the server using 'curl', they respond in ms, but if I run the same query through the web, then I see the ES Java process grow up to 99% as explained before for several seconds (so this discards the possibility of Tomcat Java process eating the CPU).

Has anyone experienced this or similar issues? Is this an OS issue? How can it be that it stops working properly suddenly, when no new data has been added or removed? What can I check?

Thanks.

Hi Shay,

Thanks for taking care of this. I'm certainly using the same version, as I
haven't changed anything in the configuration (this suddenly started to
happen).

I spent almost all night yesterday trying to pinpoint the issue, and I
noticed that disk I/O was really slow, which led me to realizing that it was
actually a memory issue that was provoking the OS to continuosly be swapping
to disk, therefore almost stucking the machine.

I added an extra 1Gb to the server in less than 1 minute (thanks God cloud
exist) and all issues went away.

However, I'm still not sure why it suddenly stopped working due to memory
issues. I'd have expected an incremental degradation of the system that
would have allowed me to foresee the issue, but as explained in my previous
email, it was like "now it works, now it doesn't work".

Regards.

On Wed, Jun 1, 2011 at 12:58 AM, Shay Banon shay.banon@elasticsearch.comwrote:

Indeed, very strange. CPU spikes are hard to solve since one first needs to
understand where they are happening. To start, are you sure you are running
the same version on both the server and the client?

The fastest way to test it is if you can run (j)visualvm on the relevant
server. You can do simple profiling there and maybe send a screenshot of
what takes most CPU.

On Wednesday, June 1, 2011 at 12:46 AM, Enrique Medina Montenegro wrote:

Hi,

I've had my application running in PROD for months, and suddenly yesterday
my server got stuck and since then I've started to notice a very weird
behaviour monitoring the ES process. I know this may sound strange, but this
is exactly the description of what's happening now (apologize if I say
something with no sense, but I'm desperated right now):

  • I restart the server and start up a Tomcat webapp application and ES
    server (both in same machine).
  • Tomcat is configured to use 1536Mb, whereas ES uses 900Mb. This is a
    CentOS5.5. server with 3Gb of physical RAM.
  • If I run a search in the webapp application, which queries internally
    using a Transport client the ES server, then I can see the CPU (using 'top')
    for the ES Java process growing to 99.9%, then maybe falling to 67%, then
    again to 98.8%, etc., for around 10 seconds, and finally my webapp replies
    displaying the results.
  • After this very first query, I monitor the CPU for the ES Java process
    and I can see sort of a constant repetitive series of 0% -> 5% -> 15% -> 0%
    -> 5% -> 15% -> etc, without making any request through the web site.
  • From now, if I execute the queries direcly in the server using 'curl',
    they respond in ms, but if I run the same query through the web, then I see
    the ES Java process grow up to 99% as explained before for several seconds
    (so this discards the possibility of Tomcat Java process eating the CPU).

Has anyone experienced this or similar issues? Is this an OS issue? How can
it be that it stops working properly suddenly, when no new data has been
added or removed? What can I check?

Thanks.

Hi Enrique

However, I'm still not sure why it suddenly stopped working due to
memory issues. I'd have expected an incremental degradation of the
system that would have allowed me to foresee the issue, but as
explained in my previous email, it was like "now it works, now it
doesn't work".

I'm guessing that your memory usage of ES increased, and part of that
memory went into swap. As soon as that happens, GC causes the process
to grind to a halt.

Important to avoid swap at all costs.

Are you using bootstrap.mlockall?

clint

Yes, I am, and also the "ulimit -l unlimited -n 90000"

And I run ES with this SH:

export ES_MIN_MEM=900m
export ES_MAX_MEM=900m
elasticsearch -f &

On Wed, Jun 1, 2011 at 11:14 AM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Enrique

However, I'm still not sure why it suddenly stopped working due to
memory issues. I'd have expected an incremental degradation of the
system that would have allowed me to foresee the issue, but as
explained in my previous email, it was like "now it works, now it
doesn't work".

I'm guessing that your memory usage of ES increased, and part of that
memory went into swap. As soon as that happens, GC causes the process
to grind to a halt.

Important to avoid swap at all costs.

Are you using bootstrap.mlockall?

clint

Just for the records, I solved this issue.

Eventually it was not ES's fault whatsoever, but the MySQL server used by my
web application. There was a query that was literally eating all of the
server's memory, even the swap, and literally blocking the server (we had to
restart it from the control panel).

Lessons learned: Untrust each and every process in your server, no matter
whether it's not a java process, like MySQL. Use "slow queries log" to
identify queries that go far beyond the acceptable response time (>1s).

On Wed, Jun 1, 2011 at 11:47 AM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

Yes, I am, and also the "ulimit -l unlimited -n 90000"

And I run ES with this SH:

export ES_MIN_MEM=900m
export ES_MAX_MEM=900m
elasticsearch -f &

On Wed, Jun 1, 2011 at 11:14 AM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Enrique

However, I'm still not sure why it suddenly stopped working due to
memory issues. I'd have expected an incremental degradation of the
system that would have allowed me to foresee the issue, but as
explained in my previous email, it was like "now it works, now it
doesn't work".

I'm guessing that your memory usage of ES increased, and part of that
memory went into swap. As soon as that happens, GC causes the process
to grind to a halt.

Important to avoid swap at all costs.

Are you using bootstrap.mlockall?

clint

Thanks for bringing this to a resolution!. You bring up an important point, which I plan to tackle soon, which is index level stats (search + index) to be able to know a bit more about whats going on in the index.

On Friday, June 3, 2011 at 6:59 PM, Enrique Medina Montenegro wrote:

Just for the records, I solved this issue.

Eventually it was not ES's fault whatsoever, but the MySQL server used by my web application. There was a query that was literally eating all of the server's memory, even the swap, and literally blocking the server (we had to restart it from the control panel).

Lessons learned: Untrust each and every process in your server, no matter whether it's not a java process, like MySQL. Use "slow queries log" to identify queries that go far beyond the acceptable response time (>1s).

On Wed, Jun 1, 2011 at 11:47 AM, Enrique Medina Montenegro <e.medina.m@gmail.com (mailto:e.medina.m@gmail.com)> wrote:

Yes, I am, and also the "ulimit -l unlimited -n 90000"

And I run ES with this SH:

export ES_MIN_MEM=900m
export ES_MAX_MEM=900m
elasticsearch -f &

On Wed, Jun 1, 2011 at 11:14 AM, Clinton Gormley <clinton@iannounce.co.uk (mailto:clinton@iannounce.co.uk)> wrote:

Hi Enrique

However, I'm still not sure why it suddenly stopped working due to
memory issues. I'd have expected an incremental degradation of the
system that would have allowed me to foresee the issue, but as
explained in my previous email, it was like "now it works, now it
doesn't work".

I'm guessing that your memory usage of ES increased, and part of that
memory went into swap. As soon as that happens, GC causes the process
to grind to a halt.

Important to avoid swap at all costs.

Are you using bootstrap.mlockall?

clint