Bizarre behaviour with ES process

Enrique_Medina_Monte · May 31, 2011, 9:46pm

Hi,

I've had my application running in PROD for months, and suddenly yesterday
my server got stuck and since then I've started to notice a very weird
behaviour monitoring the ES process. I know this may sound strange, but this
is exactly the description of what's happening now (apologize if I say
something with no sense, but I'm desperated right now):

I restart the server and start up a Tomcat webapp application and ES
server (both in same machine).
Tomcat is configured to use 1536Mb, whereas ES uses 900Mb. This is a
CentOS5.5. server with 3Gb of physical RAM.
If I run a search in the webapp application, which queries internally
using a Transport client the ES server, then I can see the CPU (using 'top')
for the ES Java process growing to 99.9%, then maybe falling to 67%, then
again to 98.8%, etc., for around 10 seconds, and finally my webapp replies
displaying the results.
After this very first query, I monitor the CPU for the ES Java process and
I can see sort of a constant repetitive series of 0% -> 5% -> 15% -> 0% ->
5% -> 15% -> etc, without making any request through the web site.
From now, if I execute the queries direcly in the server using 'curl',
they respond in ms, but if I run the same query through the web, then I see
the ES Java process grow up to 99% as explained before for several seconds
(so this discards the possibility of Tomcat Java process eating the CPU).

Has anyone experienced this or similar issues? Is this an OS issue? How can
it be that it stops working properly suddenly, when no new data has been
added or removed? What can I check?

Thanks.

kimchy · May 31, 2011, 10:58pm

Indeed, very strange. CPU spikes are hard to solve since one first needs to understand where they are happening. To start, are you sure you are running the same version on both the server and the client?

The fastest way to test it is if you can run (j)visualvm on the relevant server. You can do simple profiling there and maybe send a screenshot of what takes most CPU.

On Wednesday, June 1, 2011 at 12:46 AM, Enrique Medina Montenegro wrote:

Hi,

I've had my application running in PROD for months, and suddenly yesterday my server got stuck and since then I've started to notice a very weird behaviour monitoring the ES process. I know this may sound strange, but this is exactly the description of what's happening now (apologize if I say something with no sense, but I'm desperated right now):

I restart the server and start up a Tomcat webapp application and ES server (both in same machine).

Tomcat is configured to use 1536Mb, whereas ES uses 900Mb. This is a CentOS5.5. server with 3Gb of physical RAM.

If I run a search in the webapp application, which queries internally using a Transport client the ES server, then I can see the CPU (using 'top') for the ES Java process growing to 99.9%, then maybe falling to 67%, then again to 98.8%, etc., for around 10 seconds, and finally my webapp replies displaying the results.

After this very first query, I monitor the CPU for the ES Java process and I can see sort of a constant repetitive series of 0% -> 5% -> 15% -> 0% -> 5% -> 15% -> etc, without making any request through the web site.

From now, if I execute the queries direcly in the server using 'curl', they respond in ms, but if I run the same query through the web, then I see the ES Java process grow up to 99% as explained before for several seconds (so this discards the possibility of Tomcat Java process eating the CPU).

Has anyone experienced this or similar issues? Is this an OS issue? How can it be that it stops working properly suddenly, when no new data has been added or removed? What can I check?

Thanks.

Enrique_Medina_Monte · June 1, 2011, 7:52am

Hi Shay,

Thanks for taking care of this. I'm certainly using the same version, as I
haven't changed anything in the configuration (this suddenly started to
happen).

I spent almost all night yesterday trying to pinpoint the issue, and I
noticed that disk I/O was really slow, which led me to realizing that it was
actually a memory issue that was provoking the OS to continuosly be swapping
to disk, therefore almost stucking the machine.

I added an extra 1Gb to the server in less than 1 minute (thanks God cloud
exist) and all issues went away.

However, I'm still not sure why it suddenly stopped working due to memory
issues. I'd have expected an incremental degradation of the system that
would have allowed me to foresee the issue, but as explained in my previous
email, it was like "now it works, now it doesn't work".

Regards.

On Wed, Jun 1, 2011 at 12:58 AM, Shay Banon shay.banon@elasticsearch.comwrote:

Indeed, very strange. CPU spikes are hard to solve since one first needs to
understand where they are happening. To start, are you sure you are running
the same version on both the server and the client?

The fastest way to test it is if you can run (j)visualvm on the relevant
server. You can do simple profiling there and maybe send a screenshot of
what takes most CPU.

On Wednesday, June 1, 2011 at 12:46 AM, Enrique Medina Montenegro wrote:

Hi,

I've had my application running in PROD for months, and suddenly yesterday
my server got stuck and since then I've started to notice a very weird
behaviour monitoring the ES process. I know this may sound strange, but this
is exactly the description of what's happening now (apologize if I say
something with no sense, but I'm desperated right now):

I restart the server and start up a Tomcat webapp application and ES
server (both in same machine).

Tomcat is configured to use 1536Mb, whereas ES uses 900Mb. This is a
CentOS5.5. server with 3Gb of physical RAM.

If I run a search in the webapp application, which queries internally
using a Transport client the ES server, then I can see the CPU (using 'top')
for the ES Java process growing to 99.9%, then maybe falling to 67%, then
again to 98.8%, etc., for around 10 seconds, and finally my webapp replies
displaying the results.

After this very first query, I monitor the CPU for the ES Java process
and I can see sort of a constant repetitive series of 0% -> 5% -> 15% -> 0%
-> 5% -> 15% -> etc, without making any request through the web site.

From now, if I execute the queries direcly in the server using 'curl',
they respond in ms, but if I run the same query through the web, then I see
the ES Java process grow up to 99% as explained before for several seconds
(so this discards the possibility of Tomcat Java process eating the CPU).

Has anyone experienced this or similar issues? Is this an OS issue? How can
it be that it stops working properly suddenly, when no new data has been
added or removed? What can I check?

Thanks.

Clinton_Gormley · June 1, 2011, 9:14am

Hi Enrique

However, I'm still not sure why it suddenly stopped working due to
memory issues. I'd have expected an incremental degradation of the
system that would have allowed me to foresee the issue, but as
explained in my previous email, it was like "now it works, now it
doesn't work".

I'm guessing that your memory usage of ES increased, and part of that
memory went into swap. As soon as that happens, GC causes the process
to grind to a halt.

Important to avoid swap at all costs.

Are you using bootstrap.mlockall?

clint

Enrique_Medina_Monte · June 1, 2011, 9:47am

Yes, I am, and also the "ulimit -l unlimited -n 90000"

And I run ES with this SH:

export ES_MIN_MEM=900m
export ES_MAX_MEM=900m
elasticsearch -f &

On Wed, Jun 1, 2011 at 11:14 AM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Enrique

However, I'm still not sure why it suddenly stopped working due to
memory issues. I'd have expected an incremental degradation of the
system that would have allowed me to foresee the issue, but as
explained in my previous email, it was like "now it works, now it
doesn't work".

I'm guessing that your memory usage of ES increased, and part of that
memory went into swap. As soon as that happens, GC causes the process
to grind to a halt.

Important to avoid swap at all costs.

Are you using bootstrap.mlockall?

clint

Enrique_Medina_Monte · June 3, 2011, 3:59pm

Just for the records, I solved this issue.

Eventually it was not ES's fault whatsoever, but the MySQL server used by my
web application. There was a query that was literally eating all of the
server's memory, even the swap, and literally blocking the server (we had to
restart it from the control panel).

Lessons learned: Untrust each and every process in your server, no matter
whether it's not a java process, like MySQL. Use "slow queries log" to
identify queries that go far beyond the acceptable response time (>1s).

On Wed, Jun 1, 2011 at 11:47 AM, Enrique Medina Montenegro <
e.medina.m@gmail.com> wrote:

Yes, I am, and also the "ulimit -l unlimited -n 90000"

And I run ES with this SH:

export ES_MIN_MEM=900m
export ES_MAX_MEM=900m
elasticsearch -f &

On Wed, Jun 1, 2011 at 11:14 AM, Clinton Gormley clinton@iannounce.co.ukwrote:

Hi Enrique

However, I'm still not sure why it suddenly stopped working due to
memory issues. I'd have expected an incremental degradation of the
system that would have allowed me to foresee the issue, but as
explained in my previous email, it was like "now it works, now it
doesn't work".

I'm guessing that your memory usage of ES increased, and part of that
memory went into swap. As soon as that happens, GC causes the process
to grind to a halt.

Important to avoid swap at all costs.

Are you using bootstrap.mlockall?

clint

kimchy · June 3, 2011, 8:11pm

Thanks for bringing this to a resolution!. You bring up an important point, which I plan to tackle soon, which is index level stats (search + index) to be able to know a bit more about whats going on in the index.

On Friday, June 3, 2011 at 6:59 PM, Enrique Medina Montenegro wrote:

Just for the records, I solved this issue.

Eventually it was not ES's fault whatsoever, but the MySQL server used by my web application. There was a query that was literally eating all of the server's memory, even the swap, and literally blocking the server (we had to restart it from the control panel).

Lessons learned: Untrust each and every process in your server, no matter whether it's not a java process, like MySQL. Use "slow queries log" to identify queries that go far beyond the acceptable response time (>1s).

On Wed, Jun 1, 2011 at 11:47 AM, Enrique Medina Montenegro <e.medina.m@gmail.com (mailto:e.medina.m@gmail.com)> wrote:

Yes, I am, and also the "ulimit -l unlimited -n 90000"

And I run ES with this SH:

export ES_MIN_MEM=900m
export ES_MAX_MEM=900m
elasticsearch -f &

On Wed, Jun 1, 2011 at 11:14 AM, Clinton Gormley <clinton@iannounce.co.uk (mailto:clinton@iannounce.co.uk)> wrote:

Hi Enrique

However, I'm still not sure why it suddenly stopped working due to
memory issues. I'd have expected an incremental degradation of the
system that would have allowed me to foresee the issue, but as
explained in my previous email, it was like "now it works, now it
doesn't work".

I'm guessing that your memory usage of ES increased, and part of that
memory went into swap. As soon as that happens, GC causes the process
to grind to a halt.

Important to avoid swap at all costs.

Are you using bootstrap.mlockall?

clint

Topic		Replies	Views
ES In Prod periodically jumping to constant high CPU and super slow queries Elasticsearch	3	529	August 25, 2017
CPU usage slowly climbs until ES needs a restart Elasticsearch	5	3109	July 5, 2017
Elasticsearch consumes all CPU Elasticsearch	2	417	July 6, 2017
One ES Data node's CPU jumps to 90%+ suddenly while in production Elasticsearch	7	966	May 6, 2021
Cyclical ES CPU usage Elasticsearch	9	366	July 6, 2017

Bizarre behaviour with ES process

Related topics