Sudden Unexplained CPU Usage

Back with more issues. Periodically, and seemingly inexplicably, the whole
cluster becomes essentially unusable for what's sometimes hours. There's
nothing in the logs (or slow log) to indicate what's up, and hot_threads
(https://gist.github.com/4280439) on these machines has been more than a
little vague. Each of the machines' CPU usage gets pegged at 100% of all
cores, and then gradually the different machines back off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably, the whole
cluster becomes essentially unusable for what's sometimes hours. There's
nothing in the logs (or slow log) to indicate what's up, and hot_threads (
https://gist.github.com/4280439) on these machines has been more than a
little vague. Each of the machines' CPU usage gets pegged at 100% of all
cores, and then gradually the different machines back off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Thanks for the reply,

So, there is a certain amount of garbage collection going on, though not
enough to seem to explain this. The stack traces seem to indicate much the
same as hot_threads did (https://gist.github.com/4280439). Something that's
puzzling to me is that it seems to be doing a fair amount of reading /
writing from disk, but the partition with our data is a RAID0 across four
ephemeral drives. iotop is claiming only a few MBps of read/write, but
I've easily seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably, the
whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these machines has
been more than a little vague. Each of the machines' CPU usage gets pegged
at 100% of all cores, and then gradually the different machines back off.
From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Hi,

I can't quite tell what is going on from the screenshot.... everything
looks super small and uniform... :frowning:

If you like paramedic you may also like

But yes, I looked at those stack traces and it looks like a lot of disk
reading. Maybe iotop is lying. dstat is nice, as is iostat and vmstat.
How big is your index, how much RAM have you got? What sort of queries are
you serving? Are they very diverse? Can you show a bit of vmstat 2 output
or disk IO graph from SPM?
Have you tried using MMapDirectory?
See Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on, though not
enough to seem to explain this. The stack traces seem to indicate much the
same as hot_threads did (https://gist.github.com/4280439). Something
that's puzzling to me is that it seems to be doing a fair amount of reading
/ writing from disk, but the partition with our data is a RAID0 across four
ephemeral drives. iotop is claiming only a few MBps of read/write, but
I've easily seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably, the
whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these machines has
been more than a little vague. Each of the machines' CPU usage gets pegged
at 100% of all cores, and then gradually the different machines back off.
From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I can't quite tell what is going on from the screenshot.... everything
looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of disk
reading. Maybe iotop is lying. dstat is nice, as is iostat and vmstat.
How big is your index, how much RAM have you got? What sort of queries
are you serving? Are they very diverse? Can you show a bit of vmstat 2
output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on, though not
enough to seem to explain this. The stack traces seem to indicate much the
same as hot_threads did (https://gist.github.com/4280439). Something
that's puzzling to me is that it seems to be doing a fair amount of reading
/ writing from disk, but the partition with our data is a RAID0 across four
ephemeral drives. iotop is claiming only a few MBps of read/write, but
I've easily seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably, the
whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these machines has
been more than a little vague. Each of the machines' CPU usage gets pegged
at 100% of all cores, and then gradually the different machines back off.
From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in terms of
the content they're after. That said, most of the queries are limited to
query string, and then limiting date ranges. I've not tried MMapDirectory,
though from that documentation it sounds like the best interface is
automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I can't quite tell what is going on from the screenshot.... everything
looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of disk
reading. Maybe iotop is lying. dstat is nice, as is iostat and vmstat.
How big is your index, how much RAM have you got? What sort of queries
are you serving? Are they very diverse? Can you show a bit of vmstat 2
output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring - Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on, though not
enough to seem to explain this. The stack traces seem to indicate much the
same as hot_threads did (https://gist.github.com/4280439). Something
that's puzzling to me is that it seems to be doing a fair amount of reading
/ writing from disk, but the partition with our data is a RAID0 across four
ephemeral drives. iotop is claiming only a few MBps of read/write, but
I've easily seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably, the
whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these machines has
been more than a little vague. Each of the machines' CPU usage gets pegged
at 100% of all cores, and then gradually the different machines back off.
From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

I've now switched to mmapfs, and java 7, and still no luck. At least, we've
been able to reproduce the problem. It always happens when people query
elasticsearch :-/

We've had limited success with a handful of concurrent users using our API
internally, but after more than just a couple, it becomes almost completely
unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27 m1.xlarge
instances, each configured with a RAID0 across 4 ephemeral drives. It's
both confusing and frustrating to not understand why we're not getting
better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in terms of
the content they're after. That said, most of the queries are limited to
query string, and then limiting date ranges. I've not tried MMapDirectory,
though from that documentation it sounds like the best interface is
automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I can't quite tell what is going on from the screenshot.... everything
looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of disk
reading. Maybe iotop is lying. dstat is nice, as is iostat and vmstat.
How big is your index, how much RAM have you got? What sort of queries
are you serving? Are they very diverse? Can you show a bit of vmstat 2
output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on, though
not enough to seem to explain this. The stack traces seem to indicate much
the same as hot_threads did (https://gist.github.com/4280439).
Something that's puzzling to me is that it seems to be doing a fair amount
of reading / writing from disk, but the partition with our data is a RAID0
across four ephemeral drives. iotop is claiming only a few MBps of
read/write, but I've easily seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably, the
whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these machines
has been more than a little vague. Each of the machines' CPU usage gets
pegged at 100% of all cores, and then gradually the different machines back
off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At least,
we've been able to reproduce the problem. It always happens when people
query elasticsearch :-/

We've had limited success with a handful of concurrent users using our API
internally, but after more than just a couple, it becomes almost completely
unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27 m1.xlarge
instances, each configured with a RAID0 across 4 ephemeral drives. It's
both confusing and frustrating to not understand why we're not getting
better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in terms
of the content they're after. That said, most of the queries are limited to
query string, and then limiting date ranges. I've not tried MMapDirectory,
though from that documentation it sounds like the best interface is
automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I can't quite tell what is going on from the screenshot.... everything
looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of disk
reading. Maybe iotop is lying. dstat is nice, as is iostat and vmstat.
How big is your index, how much RAM have you got? What sort of queries
are you serving? Are they very diverse? Can you show a bit of vmstat 2
output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on, though
not enough to seem to explain this. The stack traces seem to indicate much
the same as hot_threads did (https://gist.github.com/4280439).
Something that's puzzling to me is that it seems to be doing a fair amount
of reading / writing from disk, but the partition with our data is a RAID0
across four ephemeral drives. iotop is claiming only a few MBps of
read/write, but I've easily seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic
wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably, the
whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these machines
has been more than a little vague. Each of the machines' CPU usage gets
pegged at 100% of all cores, and then gradually the different machines back
off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

The stack traces from jstack and hot threads are essentially the same.
jstack's quite a bit more verbose, but effectively the same.

The hot threads now (using mmapfs) no longer say anything about nio, but
the profile is very similar in terms of how much CPU is listed as getting
used (gist updated: https://gist.github.com/4280439). We've been watching
iostat, because even after we submit some basic queries and it gets into
this state, it remains in a bad way for quite a while. What we're seeing is
the io ops (included in gist) stay relatively high during these times
despite the throughput (sounds a little like thrashing to me). Could this
be symptomatic of the number of indexes we have? The number of shards?

On Tuesday, December 18, 2012 11:18:36 AM UTC-8, Igor Motov wrote:

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At least,
we've been able to reproduce the problem. It always happens when people
query elasticsearch :-/

We've had limited success with a handful of concurrent users using our
API internally, but after more than just a couple, it becomes almost
completely unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27 m1.xlarge
instances, each configured with a RAID0 across 4 ephemeral drives. It's
both confusing and frustrating to not understand why we're not getting
better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in terms
of the content they're after. That said, most of the queries are limited to
query string, and then limiting date ranges. I've not tried MMapDirectory,
though from that documentation it sounds like the best interface is
automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I can't quite tell what is going on from the screenshot.... everything
looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of
disk reading. Maybe iotop is lying. dstat is nice, as is iostat and
vmstat.
How big is your index, how much RAM have you got? What sort of
queries are you serving? Are they very diverse? Can you show a bit of
vmstat 2 output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on, though
not enough to seem to explain this. The stack traces seem to indicate much
the same as hot_threads did (https://gist.github.com/4280439).
Something that's puzzling to me is that it seems to be doing a fair amount
of reading / writing from disk, but the partition with our data is a RAID0
across four ephemeral drives. iotop is claiming only a few MBps of
read/write, but I've easily seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic
wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably,
the whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these machines
has been more than a little vague. Each of the machines' CPU usage gets
pegged at 100% of all cores, and then gradually the different machines back
off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Which file is the latest hot thread dump?

Could you also add the output of the following commands to your gist?

curl "localhost:9200/_nodes/_local/jvm?pretty=true"
curl "localhost:9200/_nodes/_local/stats?jvm=true&pretty=true"

On Tuesday, December 18, 2012 2:27:13 PM UTC-5, Dan Lecocq wrote:

The stack traces from jstack and hot threads are essentially the same.
jstack's quite a bit more verbose, but effectively the same.

The hot threads now (using mmapfs) no longer say anything about nio, but
the profile is very similar in terms of how much CPU is listed as getting
used (gist updated: https://gist.github.com/4280439). We've been watching
iostat, because even after we submit some basic queries and it gets into
this state, it remains in a bad way for quite a while. What we're seeing is
the io ops (included in gist) stay relatively high during these times
despite the throughput (sounds a little like thrashing to me). Could this
be symptomatic of the number of indexes we have? The number of shards?

On Tuesday, December 18, 2012 11:18:36 AM UTC-8, Igor Motov wrote:

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At least,
we've been able to reproduce the problem. It always happens when people
query elasticsearch :-/

We've had limited success with a handful of concurrent users using our
API internally, but after more than just a couple, it becomes almost
completely unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27 m1.xlarge
instances, each configured with a RAID0 across 4 ephemeral drives. It's
both confusing and frustrating to not understand why we're not getting
better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in terms
of the content they're after. That said, most of the queries are limited to
query string, and then limiting date ranges. I've not tried MMapDirectory,
though from that documentation it sounds like the best interface is
automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I can't quite tell what is going on from the screenshot....
everything looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of
disk reading. Maybe iotop is lying. dstat is nice, as is iostat and
vmstat.
How big is your index, how much RAM have you got? What sort of
queries are you serving? Are they very diverse? Can you show a bit of
vmstat 2 output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on, though
not enough to seem to explain this. The stack traces seem to indicate much
the same as hot_threads did (https://gist.github.com/4280439).
Something that's puzzling to me is that it seems to be doing a fair amount
of reading / writing from disk, but the partition with our data is a RAID0
across four ephemeral drives. iotop is claiming only a few MBps of
read/write, but I've easily seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic
wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably,
the whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these machines
has been more than a little vague. Each of the machines' CPU usage gets
pegged at 100% of all cores, and then gradually the different machines back
off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Sorry for the late reply :-/ We've been able to reproduce this condition by
"flooding" the cluster with a few basic queries at a time. Most of these
time out (and the cluster stays in this spinning-its-wheels state for as
much as an hour), but they all generally take at least 10 seconds if they
don't time out.

We compiled pylucene on one of the nodes so that we could try to query the
individual lucene indexes (to verify what we found using luke), and we find
that most of the shards on a node can be queries very quickly. In fact,
that average query time there is about 3-5ms.

I updated this gist with the jvm and stats info. Thanks for your continued
interest in this issue :-/ It's quickly becoming extremely disconcerting.

On Tuesday, December 18, 2012 11:46:28 AM UTC-8, Igor Motov wrote:

Which file is the latest hot thread dump?

Could you also add the output of the following commands to your gist?

curl "localhost:9200/_nodes/_local/jvm?pretty=true"
curl "localhost:9200/_nodes/_local/stats?jvm=true&pretty=true"

On Tuesday, December 18, 2012 2:27:13 PM UTC-5, Dan Lecocq wrote:

The stack traces from jstack and hot threads are essentially the same.
jstack's quite a bit more verbose, but effectively the same.

The hot threads now (using mmapfs) no longer say anything about nio, but
the profile is very similar in terms of how much CPU is listed as getting
used (gist updated: https://gist.github.com/4280439). We've been
watching iostat, because even after we submit some basic queries and it
gets into this state, it remains in a bad way for quite a while. What we're
seeing is the io ops (included in gist) stay relatively high during these
times despite the throughput (sounds a little like thrashing to me). Could
this be symptomatic of the number of indexes we have? The number of shards?

On Tuesday, December 18, 2012 11:18:36 AM UTC-8, Igor Motov wrote:

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At least,
we've been able to reproduce the problem. It always happens when people
query elasticsearch :-/

We've had limited success with a handful of concurrent users using our
API internally, but after more than just a couple, it becomes almost
completely unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27 m1.xlarge
instances, each configured with a RAID0 across 4 ephemeral drives. It's
both confusing and frustrating to not understand why we're not getting
better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in
terms of the content they're after. That said, most of the queries are
limited to query string, and then limiting date ranges. I've not tried
MMapDirectory, though from that documentation it sounds like the best
interface is automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic wrote:

Hi,

I can't quite tell what is going on from the screenshot....
everything looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of
disk reading. Maybe iotop is lying. dstat is nice, as is iostat and
vmstat.
How big is your index, how much RAM have you got? What sort of
queries are you serving? Are they very diverse? Can you show a bit of
vmstat 2 output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on,
though not enough to seem to explain this. The stack traces seem to
indicate much the same as hot_threads did (
https://gist.github.com/4280439). Something that's puzzling to me
is that it seems to be doing a fair amount of reading / writing from disk,
but the partition with our data is a RAID0 across four ephemeral drives.
iotop is claiming only a few MBps of read/write, but I've easily seen it
hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic
wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this is
happening? Have you tried looking at the thread dump? What about all the
other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably,
the whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these
machines has been more than a little vague. Each of the machines' CPU usage
gets pegged at 100% of all cores, and then gradually the different machines
back off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Are you using leading wildcards in these "basic" queries?

On Thursday, December 20, 2012 4:48:59 PM UTC-5, Dan Lecocq wrote:

Sorry for the late reply :-/ We've been able to reproduce this condition
by "flooding" the cluster with a few basic queries at a time. Most of these
time out (and the cluster stays in this spinning-its-wheels state for as
much as an hour), but they all generally take at least 10 seconds if they
don't time out.

We compiled pylucene on one of the nodes so that we could try to query the
individual lucene indexes (to verify what we found using luke), and we find
that most of the shards on a node can be queries very quickly. In fact,
that average query time there is about 3-5ms.

I updated this gist with the jvm and stats info. Thanks for your continued
interest in this issue :-/ It's quickly becoming extremely disconcerting.

On Tuesday, December 18, 2012 11:46:28 AM UTC-8, Igor Motov wrote:

Which file is the latest hot thread dump?

Could you also add the output of the following commands to your gist?

curl "localhost:9200/_nodes/_local/jvm?pretty=true"
curl "localhost:9200/_nodes/_local/stats?jvm=true&pretty=true"

On Tuesday, December 18, 2012 2:27:13 PM UTC-5, Dan Lecocq wrote:

The stack traces from jstack and hot threads are essentially the same.
jstack's quite a bit more verbose, but effectively the same.

The hot threads now (using mmapfs) no longer say anything about nio, but
the profile is very similar in terms of how much CPU is listed as getting
used (gist updated: https://gist.github.com/4280439). We've been
watching iostat, because even after we submit some basic queries and it
gets into this state, it remains in a bad way for quite a while. What we're
seeing is the io ops (included in gist) stay relatively high during these
times despite the throughput (sounds a little like thrashing to me). Could
this be symptomatic of the number of indexes we have? The number of shards?

On Tuesday, December 18, 2012 11:18:36 AM UTC-8, Igor Motov wrote:

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At least,
we've been able to reproduce the problem. It always happens when people
query elasticsearch :-/

We've had limited success with a handful of concurrent users using our
API internally, but after more than just a couple, it becomes almost
completely unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27 m1.xlarge
instances, each configured with a RAID0 across 4 ephemeral drives. It's
both confusing and frustrating to not understand why we're not getting
better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in
terms of the content they're after. That said, most of the queries are
limited to query string, and then limiting date ranges. I've not tried
MMapDirectory, though from that documentation it sounds like the best
interface is automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic
wrote:

Hi,

I can't quite tell what is going on from the screenshot....
everything looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of
disk reading. Maybe iotop is lying. dstat is nice, as is iostat and
vmstat.
How big is your index, how much RAM have you got? What sort of
queries are you serving? Are they very diverse? Can you show a bit of
vmstat 2 output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on,
though not enough to seem to explain this. The stack traces seem to
indicate much the same as hot_threads did (
https://gist.github.com/4280439). Something that's puzzling to me
is that it seems to be doing a fair amount of reading / writing from disk,
but the partition with our data is a RAID0 across four ephemeral drives.
iotop is claiming only a few MBps of read/write, but I've easily seen it
hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic
wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this
is happening? Have you tried looking at the thread dump? What about all
the other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq wrote:

Back with more issues. Periodically, and seemingly inexplicably,
the whole cluster becomes essentially unusable for what's sometimes hours.
There's nothing in the logs (or slow log) to indicate what's up, and
hot_threads (https://gist.github.com/4280439) on these
machines has been more than a little vague. Each of the machines' CPU usage
gets pegged at 100% of all cores, and then gradually the different machines
back off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Our initial tests included some wildcards in some queries, but they were
never leading wildcards, and we've since removed them for subsequent tests
and still see this issue.

On Thursday, December 20, 2012 2:27:01 PM UTC-8, Igor Motov wrote:

Are you using leading wildcards in these "basic" queries?

On Thursday, December 20, 2012 4:48:59 PM UTC-5, Dan Lecocq wrote:

Sorry for the late reply :-/ We've been able to reproduce this condition
by "flooding" the cluster with a few basic queries at a time. Most of these
time out (and the cluster stays in this spinning-its-wheels state for as
much as an hour), but they all generally take at least 10 seconds if they
don't time out.

We compiled pylucene on one of the nodes so that we could try to query
the individual lucene indexes (to verify what we found using luke), and we
find that most of the shards on a node can be queries very quickly. In
fact, that average query time there is about 3-5ms.

I updated this gist with the jvm and stats info. Thanks for your
continued interest in this issue :-/ It's quickly becoming extremely
disconcerting.

On Tuesday, December 18, 2012 11:46:28 AM UTC-8, Igor Motov wrote:

Which file is the latest hot thread dump?

Could you also add the output of the following commands to your gist?

curl "localhost:9200/_nodes/_local/jvm?pretty=true"
curl "localhost:9200/_nodes/_local/stats?jvm=true&pretty=true"

On Tuesday, December 18, 2012 2:27:13 PM UTC-5, Dan Lecocq wrote:

The stack traces from jstack and hot threads are essentially the same.
jstack's quite a bit more verbose, but effectively the same.

The hot threads now (using mmapfs) no longer say anything about nio,
but the profile is very similar in terms of how much CPU is listed as
getting used (gist updated: https://gist.github.com/4280439). We've
been watching iostat, because even after we submit some basic queries and
it gets into this state, it remains in a bad way for quite a while. What
we're seeing is the io ops (included in gist) stay relatively high during
these times despite the throughput (sounds a little like thrashing to me).
Could this be symptomatic of the number of indexes we have? The number of
shards?

On Tuesday, December 18, 2012 11:18:36 AM UTC-8, Igor Motov wrote:

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At least,
we've been able to reproduce the problem. It always happens when people
query elasticsearch :-/

We've had limited success with a handful of concurrent users using
our API internally, but after more than just a couple, it becomes almost
completely unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27 m1.xlarge
instances, each configured with a RAID0 across 4 ephemeral drives. It's
both confusing and frustrating to not understand why we're not getting
better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in
terms of the content they're after. That said, most of the queries are
limited to query string, and then limiting date ranges. I've not tried
MMapDirectory, though from that documentation it sounds like the best
interface is automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic
wrote:

Hi,

I can't quite tell what is going on from the screenshot....
everything looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot of
disk reading. Maybe iotop is lying. dstat is nice, as is iostat and
vmstat.
How big is your index, how much RAM have you got? What sort of
queries are you serving? Are they very diverse? Can you show a bit of
vmstat 2 output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on,
though not enough to seem to explain this. The stack traces seem to
indicate much the same as hot_threads did (
https://gist.github.com/4280439). Something that's puzzling to
me is that it seems to be doing a fair amount of reading / writing from
disk, but the partition with our data is a RAID0 across four ephemeral
drives. iotop is claiming only a few MBps of read/write, but I've easily
seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis Gospodnetic
wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this
is happening? Have you tried looking at the thread dump? What about all
the other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq
wrote:

Back with more issues. Periodically, and seemingly
inexplicably, the whole cluster becomes essentially unusable for what's
sometimes hours. There's nothing in the logs (or slow log) to indicate
what's up, and hot_threads (https://gist.github.com/4280439)
on these machines has been more than a little vague. Each of the machines'
CPU usage gets pegged at 100% of all cores, and then gradually the
different machines back off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Oh, it's also worth mentioning that at this point, given our free RAM on
these instances is extremely low, we think it might be an issue of
thrashing the fs cache. We're giving it a try with instances with both
higher IO and higher memory available.

On Thursday, December 20, 2012 2:45:11 PM UTC-8, Dan Lecocq wrote:

Our initial tests included some wildcards in some queries, but they were
never leading wildcards, and we've since removed them for subsequent tests
and still see this issue.

On Thursday, December 20, 2012 2:27:01 PM UTC-8, Igor Motov wrote:

Are you using leading wildcards in these "basic" queries?

On Thursday, December 20, 2012 4:48:59 PM UTC-5, Dan Lecocq wrote:

Sorry for the late reply :-/ We've been able to reproduce this condition
by "flooding" the cluster with a few basic queries at a time. Most of these
time out (and the cluster stays in this spinning-its-wheels state for as
much as an hour), but they all generally take at least 10 seconds if they
don't time out.

We compiled pylucene on one of the nodes so that we could try to query
the individual lucene indexes (to verify what we found using luke), and we
find that most of the shards on a node can be queries very quickly. In
fact, that average query time there is about 3-5ms.

I updated this gist with the jvm and stats info. Thanks for your
continued interest in this issue :-/ It's quickly becoming extremely
disconcerting.

On Tuesday, December 18, 2012 11:46:28 AM UTC-8, Igor Motov wrote:

Which file is the latest hot thread dump?

Could you also add the output of the following commands to your gist?

curl "localhost:9200/_nodes/_local/jvm?pretty=true"
curl "localhost:9200/_nodes/_local/stats?jvm=true&pretty=true"

On Tuesday, December 18, 2012 2:27:13 PM UTC-5, Dan Lecocq wrote:

The stack traces from jstack and hot threads are essentially the same.
jstack's quite a bit more verbose, but effectively the same.

The hot threads now (using mmapfs) no longer say anything about nio,
but the profile is very similar in terms of how much CPU is listed as
getting used (gist updated: https://gist.github.com/4280439). We've
been watching iostat, because even after we submit some basic queries and
it gets into this state, it remains in a bad way for quite a while. What
we're seeing is the io ops (included in gist) stay relatively high during
these times despite the throughput (sounds a little like thrashing to me).
Could this be symptomatic of the number of indexes we have? The number of
shards?

On Tuesday, December 18, 2012 11:18:36 AM UTC-8, Igor Motov wrote:

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At
least, we've been able to reproduce the problem. It always happens when
people query elasticsearch :-/

We've had limited success with a handful of concurrent users using
our API internally, but after more than just a couple, it becomes almost
completely unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27
m1.xlarge instances, each configured with a RAID0 across 4 ephemeral
drives. It's both confusing and frustrating to not understand why we're not
getting better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in
terms of the content they're after. That said, most of the queries are
limited to query string, and then limiting date ranges. I've not tried
MMapDirectory, though from that documentation it sounds like the best
interface is automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing looks
like this java bug: http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic
wrote:

Hi,

I can't quite tell what is going on from the screenshot....
everything looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot
of disk reading. Maybe iotop is lying. dstat is nice, as is iostat and
vmstat.
How big is your index, how much RAM have you got? What sort of
queries are you serving? Are they very diverse? Can you show a bit of
vmstat 2 output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on,
though not enough to seem to explain this. The stack traces seem to
indicate much the same as hot_threads did (
https://gist.github.com/4280439). Something that's puzzling to
me is that it seems to be doing a fair amount of reading / writing from
disk, but the partition with our data is a RAID0 across four ephemeral
drives. iotop is claiming only a few MBps of read/write, but I've easily
seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis
Gospodnetic wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when this
is happening? Have you tried looking at the thread dump? What about all
the other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq
wrote:

Back with more issues. Periodically, and seemingly
inexplicably, the whole cluster becomes essentially unusable for what's
sometimes hours. There's nothing in the logs (or slow log) to indicate
what's up, and hot_threads (https://gist.github.com/4280439)
on these machines has been more than a little vague. Each of the machines'
CPU usage gets pegged at 100% of all cores, and then gradually the
different machines back off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Judging from hot threads that you posted you are still running wild card
queries and this is where most of the time is spent.

On Thursday, December 20, 2012 5:46:35 PM UTC-5, Dan Lecocq wrote:

Oh, it's also worth mentioning that at this point, given our free RAM on
these instances is extremely low, we think it might be an issue of
thrashing the fs cache. We're giving it a try with instances with both
higher IO and higher memory available.

On Thursday, December 20, 2012 2:45:11 PM UTC-8, Dan Lecocq wrote:

Our initial tests included some wildcards in some queries, but they were
never leading wildcards, and we've since removed them for subsequent tests
and still see this issue.

On Thursday, December 20, 2012 2:27:01 PM UTC-8, Igor Motov wrote:

Are you using leading wildcards in these "basic" queries?

On Thursday, December 20, 2012 4:48:59 PM UTC-5, Dan Lecocq wrote:

Sorry for the late reply :-/ We've been able to reproduce this
condition by "flooding" the cluster with a few basic queries at a time.
Most of these time out (and the cluster stays in this spinning-its-wheels
state for as much as an hour), but they all generally take at least 10
seconds if they don't time out.

We compiled pylucene on one of the nodes so that we could try to query
the individual lucene indexes (to verify what we found using luke), and we
find that most of the shards on a node can be queries very quickly. In
fact, that average query time there is about 3-5ms.

I updated this gist with the jvm and stats info. Thanks for your
continued interest in this issue :-/ It's quickly becoming extremely
disconcerting.

On Tuesday, December 18, 2012 11:46:28 AM UTC-8, Igor Motov wrote:

Which file is the latest hot thread dump?

Could you also add the output of the following commands to your gist?

curl "localhost:9200/_nodes/_local/jvm?pretty=true"
curl "localhost:9200/_nodes/_local/stats?jvm=true&pretty=true"

On Tuesday, December 18, 2012 2:27:13 PM UTC-5, Dan Lecocq wrote:

The stack traces from jstack and hot threads are essentially the
same. jstack's quite a bit more verbose, but effectively the same.

The hot threads now (using mmapfs) no longer say anything about nio,
but the profile is very similar in terms of how much CPU is listed as
getting used (gist updated: https://gist.github.com/4280439). We've
been watching iostat, because even after we submit some basic queries and
it gets into this state, it remains in a bad way for quite a while. What
we're seeing is the io ops (included in gist) stay relatively high during
these times despite the throughput (sounds a little like thrashing to me).
Could this be symptomatic of the number of indexes we have? The number of
shards?

On Tuesday, December 18, 2012 11:18:36 AM UTC-8, Igor Motov wrote:

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At
least, we've been able to reproduce the problem. It always happens when
people query elasticsearch :-/

We've had limited success with a handful of concurrent users using
our API internally, but after more than just a couple, it becomes almost
completely unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27
m1.xlarge instances, each configured with a RAID0 across 4 ephemeral
drives. It's both confusing and frustrating to not understand why we're not
getting better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in
terms of the content they're after. That said, most of the queries are
limited to query string, and then limiting date ranges. I've not tried
MMapDirectory, though from that documentation it sounds like the best
interface is automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing
looks like this java bug:
http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic
wrote:

Hi,

I can't quite tell what is going on from the screenshot....
everything looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot
of disk reading. Maybe iotop is lying. dstat is nice, as is iostat and
vmstat.
How big is your index, how much RAM have you got? What sort of
queries are you serving? Are they very diverse? Can you show a bit of
vmstat 2 output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on,
though not enough to seem to explain this. The stack traces seem to
indicate much the same as hot_threads did (
https://gist.github.com/4280439). Something that's puzzling to
me is that it seems to be doing a fair amount of reading / writing from
disk, but the partition with our data is a RAID0 across four ephemeral
drives. iotop is claiming only a few MBps of read/write, but I've easily
seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis
Gospodnetic wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when
this is happening? Have you tried looking at the thread dump? What about
all the other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq
wrote:

Back with more issues. Periodically, and seemingly
inexplicably, the whole cluster becomes essentially unusable for what's
sometimes hours. There's nothing in the logs (or slow log) to indicate
what's up, and hot_threads (https://gist.github.com/4280439)
on these machines has been more than a little vague. Each of the machines'
CPU usage gets pegged at 100% of all cores, and then gradually the
different machines back off. From paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

D'oh -- I think I deleted the wrong hot_threads in that gist.

At any rate, we just ran a test using pylucene against the lucene indexes
from one of our nodes, using an instance with more RAM. We suspect that we
were thrashing the page-level cache, and judging from our results on a
m2.4xlarge, we've largely confirmed our suspicions. Our next step is to
spin up a small ES cluster with that instance type and confirm the same
timing results through ES.

On Thursday, December 20, 2012 2:51:49 PM UTC-8, Igor Motov wrote:

Judging from hot threads that you posted you are still running wild card
queries and this is where most of the time is spent.

On Thursday, December 20, 2012 5:46:35 PM UTC-5, Dan Lecocq wrote:

Oh, it's also worth mentioning that at this point, given our free RAM on
these instances is extremely low, we think it might be an issue of
thrashing the fs cache. We're giving it a try with instances with both
higher IO and higher memory available.

On Thursday, December 20, 2012 2:45:11 PM UTC-8, Dan Lecocq wrote:

Our initial tests included some wildcards in some queries, but they were
never leading wildcards, and we've since removed them for subsequent tests
and still see this issue.

On Thursday, December 20, 2012 2:27:01 PM UTC-8, Igor Motov wrote:

Are you using leading wildcards in these "basic" queries?

On Thursday, December 20, 2012 4:48:59 PM UTC-5, Dan Lecocq wrote:

Sorry for the late reply :-/ We've been able to reproduce this
condition by "flooding" the cluster with a few basic queries at a time.
Most of these time out (and the cluster stays in this spinning-its-wheels
state for as much as an hour), but they all generally take at least 10
seconds if they don't time out.

We compiled pylucene on one of the nodes so that we could try to query
the individual lucene indexes (to verify what we found using luke), and we
find that most of the shards on a node can be queries very quickly. In
fact, that average query time there is about 3-5ms.

I updated this gist with the jvm and stats info. Thanks for your
continued interest in this issue :-/ It's quickly becoming extremely
disconcerting.

On Tuesday, December 18, 2012 11:46:28 AM UTC-8, Igor Motov wrote:

Which file is the latest hot thread dump?

Could you also add the output of the following commands to your gist?

curl "localhost:9200/_nodes/_local/jvm?pretty=true"
curl "localhost:9200/_nodes/_local/stats?jvm=true&pretty=true"

On Tuesday, December 18, 2012 2:27:13 PM UTC-5, Dan Lecocq wrote:

The stack traces from jstack and hot threads are essentially the
same. jstack's quite a bit more verbose, but effectively the same.

The hot threads now (using mmapfs) no longer say anything about nio,
but the profile is very similar in terms of how much CPU is listed as
getting used (gist updated: https://gist.github.com/4280439). We've
been watching iostat, because even after we submit some basic queries and
it gets into this state, it remains in a bad way for quite a while. What
we're seeing is the io ops (included in gist) stay relatively high during
these times despite the throughput (sounds a little like thrashing to me).
Could this be symptomatic of the number of indexes we have? The number of
shards?

On Tuesday, December 18, 2012 11:18:36 AM UTC-8, Igor Motov wrote:

Are you getting the same stack traces in hot threads?

On Monday, December 17, 2012 6:26:01 PM UTC-5, Dan Lecocq wrote:

I've now switched to mmapfs, and java 7, and still no luck. At
least, we've been able to reproduce the problem. It always happens when
people query elasticsearch :-/

We've had limited success with a handful of concurrent users using
our API internally, but after more than just a couple, it becomes almost
completely unresponsive (multi-minute search times).

We have about 440M docs across about a dozen indexes, on 27
m1.xlarge instances, each configured with a RAID0 across 4 ephemeral
drives. It's both confusing and frustrating to not understand why we're not
getting better query performance.

On Monday, December 17, 2012 8:07:51 AM UTC-8, Dan Lecocq wrote:

Hey guys, thanks for the replies.

The queries are pretty straightforward, and relatively diverse in
terms of the content they're after. That said, most of the queries are
limited to query string, and then limiting date ranges. I've not tried
MMapDirectory, though from that documentation it sounds like the best
interface is automatically chosen?

WRT the java version, I can try 7, as we're apparently running 6.

On Monday, December 17, 2012 7:45:25 AM UTC-8, Igor Motov wrote:

Dan,

Which version of java are you using? What you are describing
looks like this java bug:
http://bugs.sun.com/view_bug.do?bug_id=6919638

I would recommend upgrading to the latest java 7 release.

Igor

On Friday, December 14, 2012 8:14:41 PM UTC-8, Otis Gospodnetic
wrote:

Hi,

I can't quite tell what is going on from the screenshot....
everything looks super small and uniform... :frowning:

If you like paramedic you may also like
Elasticsearch Monitoring

But yes, I looked at those stack traces and it looks like a lot
of disk reading. Maybe iotop is lying. dstat is nice, as is iostat and
vmstat.
How big is your index, how much RAM have you got? What sort of
queries are you serving? Are they very diverse? Can you show a bit of
vmstat 2 output or disk IO graph from SPM?
Have you tried using MMapDirectory? See
Elasticsearch Platform — Find real-time answers at scale | Elastic

Otis

ELASTICSEARCH Performance Monitoring -
Sematext Monitoring | Infrastructure Monitoring Service

On Friday, December 14, 2012 5:34:24 PM UTC-5, Dan Lecocq wrote:

Thanks for the reply,

So, there is a certain amount of garbage collection going on,
though not enough to seem to explain this. The stack traces seem to
indicate much the same as hot_threads did (
https://gist.github.com/4280439). Something that's puzzling
to me is that it seems to be doing a fair amount of reading / writing from
disk, but the partition with our data is a RAID0 across four ephemeral
drives. iotop is claiming only a few MBps of read/write, but I've easily
seen it hit 100MBps.

Here's also a full picture from paramedic:

https://lh3.googleusercontent.com/-i-OXrFSbLH0/UMupZivxgWI/AAAAAAAAAAw/DOL1B2BJLoc/s1600/Paramedic+++freshscape+production+sudden+unexplained.png

On Thursday, December 13, 2012 8:25:06 PM UTC-8, Otis
Gospodnetic wrote:

Ouch, that's a lot of green. How's your JVM/GC doing when
this is happening? Have you tried looking at the thread dump? What about
all the other system/ES metrics?

Otis

ELASTICSEARCH Performance Monitoring -
Elasticsearch Monitoringhttp://sematext.com/spm/index.html

On Thursday, December 13, 2012 5:05:05 PM UTC-5, Dan Lecocq
wrote:

Back with more issues. Periodically, and seemingly
inexplicably, the whole cluster becomes essentially unusable for what's
sometimes hours. There's nothing in the logs (or slow log) to indicate
what's up, and hot_threads (
https://gist.github.com/4280439) on these machines has been
more than a little vague. Each of the machines' CPU usage gets pegged at
100% of all cores, and then gradually the different machines back off. From
paramedic:

https://lh6.googleusercontent.com/-Y1G_abxvcEY/UMpQSKkrtJI/AAAAAAAAAAg/NryuOOLXEAM/s1600/Screen+Shot+2012-12-13+at+2.01.26+PM.png

I'm at a loss trying to explaining why this is happening

--

Hi Dan,

would be interesting to know about what ratio between JVM memory and other
memory for the machine you had before and after tuning, to help users who
are experiencing similar CPU hogs.

Thanks,

Jörg

--