Investigating elasticsearch load issues

I'm trying to investigate relatively high load we're seeing across every
node in a 15 instance cluster. So far I haven't been able to track down
exactly what's going on. Performance itself doesn't seem to be a problem.
However, I don't think that we need a 15 node cluster for the size data we
have and I'd like to get that down. I don't feel like I can do that while
we are seeing this type of load across the cluster.

Overall, usage is relatively low right now. Most of our users are college
students so the previous two weeks saw the highest usage of the year.
During that time, however, we had about the same load as we do currently
("es_10_day_cluster_load.png" screenshot attached, pulled from data sent to
graphite). There is little variation and a load of 1.5 across the entire
cluster seems relatively high to me.

I spent part of yesterday morning looking into things a bit further. Using
hot_threads reveals the following:

Running "top" on the same instance gives us this (Note there is little wait
time so it doesn't seem to be I/O):

We also have bigdesk running but haven't seen much there that indicates
something is amiss. I've attached a screenshot from the node corresponding
to the two previous gists ("es18_bigdesk_output_1.png" and
"es18_bigdesk_output_2.png").

We thought that maybe garbage collection was the problem, but it seems that
we would be seeing higher CPU usage. We also aren't seeing any warnings in
the logs about garbage collection. There are a few slow queries in the logs
but it's happening sporadically and they don't seem like they are affecting
the ongoing high load. For an uptime of 245 hours, there are about 3.33
hours corresponding to garbage collection as showed in the bigdesk
screenshot. The "user total" for CPU time according to Big Desk does seem
to be about 55 hours, but that seems like a lot given the CPU %. I can't
see where the CPU usage is coming from though.

The bottom line is that I don't think we need 15 instances in order to
handle our search data. Cutting that by 1/3 would save us a load of money.
But we can't just jump to a smaller cluster without knowing that we won't
see higher load per instance.

Thanks in advance for any insight,

Dale

Other info:
Elasticsearch version: 20.6
Java version: 1.7.0_17
AWS Instance size: m2.2xlarge 34.2 GiB of memory
ES heap size: 20GB
Local storiage: 8x 60GB EBS volumes in RAID-0
Indices: 7 in total. The larger indices have 15 shards, the smaller have 5,
each with a single replica. The largest index has 133M documents in it with
a size of 145GB (290GB including the replicas).

Happy to provide any additional data we can.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello Dale,

I think the CPU usage comes from the searches. From the Bigdesk screenshot,
I see something like 10 queries per second.

20GB to ES seems like too much. The heap usage from BigDesk shows 5GB usage
or so. I'd change the heap size to 10GB, and that would allow more room for
OS caches, which should help the queries and hopefully also lower the load.

Speaking of load, an average of 2 on a 4-CPU machine doesn't sound that
much. O would assume that if you change the heap to 10GB, you can take one
node out of the cluster at a time and see what happens. I would suspect
that 8 nodes or so should handle the load with these "low-risk" changes.

Another thing that might help is upgrading to 0.90. It's faster for most
tasks and it should use less memory. And your indices should get smaller
(btw, are you using store compression on 0.20?).

Also, I would assume you have ~140 shards now in total. Since your load
seems to be generated mostly by searches, I would assume a lower number of
shards would help, especially if you want to come down to 5 nodes or so.

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, May 14, 2013 at 7:02 PM, Dale Beermann beermann@gmail.com wrote:

I'm trying to investigate relatively high load we're seeing across every
node in a 15 instance cluster. So far I haven't been able to track down
exactly what's going on. Performance itself doesn't seem to be a problem.
However, I don't think that we need a 15 node cluster for the size data we
have and I'd like to get that down. I don't feel like I can do that while
we are seeing this type of load across the cluster.

Overall, usage is relatively low right now. Most of our users are college
students so the previous two weeks saw the highest usage of the year.
During that time, however, we had about the same load as we do currently
("es_10_day_cluster_load.png" screenshot attached, pulled from data sent to
graphite). There is little variation and a load of 1.5 across the entire
cluster seems relatively high to me.

I spent part of yesterday morning looking into things a bit further. Using
hot_threads reveals the following:

https://gist.github.com/anonymous/15c49c761ad468a367be

Running "top" on the same instance gives us this (Note there is little
wait time so it doesn't seem to be I/O):

https://gist.github.com/anonymous/bf7cc3a308b4e2d7cd58

We also have bigdesk running but haven't seen much there that indicates
something is amiss. I've attached a screenshot from the node corresponding
to the two previous gists ("es18_bigdesk_output_1.png" and
"es18_bigdesk_output_2.png").

We thought that maybe garbage collection was the problem, but it seems
that we would be seeing higher CPU usage. We also aren't seeing any
warnings in the logs about garbage collection. There are a few slow queries
in the logs but it's happening sporadically and they don't seem like they
are affecting the ongoing high load. For an uptime of 245 hours, there are
about 3.33 hours corresponding to garbage collection as showed in the
bigdesk screenshot. The "user total" for CPU time according to Big Desk
does seem to be about 55 hours, but that seems like a lot given the CPU %.
I can't see where the CPU usage is coming from though.

The bottom line is that I don't think we need 15 instances in order to
handle our search data. Cutting that by 1/3 would save us a load of money.
But we can't just jump to a smaller cluster without knowing that we won't
see higher load per instance.

Thanks in advance for any insight,

Dale

Other info:
Elasticsearch version: 20.6
Java version: 1.7.0_17
AWS Instance size: m2.2xlarge 34.2 GiB of memory
ES heap size: 20GB
Local storiage: 8x 60GB EBS volumes in RAID-0
Indices: 7 in total. The larger indices have 15 shards, the smaller have
5, each with a single replica. The largest index has 133M documents in it
with a size of 145GB (290GB including the replicas).

Happy to provide any additional data we can.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Thank you for the reply Radu. A few answers inline:

On Tuesday, May 14, 2013 10:15:52 AM UTC-7, Radu Gheorghe wrote:

Hello Dale,

I think the CPU usage comes from the searches. From the Bigdesk
screenshot, I see something like 10 queries per second.

This doesn't seem like a lot to me. Is it normal that there's such a
discrepancy between query and fetch? Is there any way to track down exactly
which searches are causing most of the load?

20GB to ES seems like too much. The heap usage from BigDesk shows 5GB usage

or so. I'd change the heap size to 10GB, and that would allow more room for
OS caches, which should help the queries and hopefully also lower the load.

We can try this, although there should still be 10+GB on those machines for
the OS caches, so I can't imagine this is really the issue.

Speaking of load, an average of 2 on a 4-CPU machine doesn't sound that
much. O would assume that if you change the heap to 10GB, you can take one
node out of the cluster at a time and see what happens. I would suspect
that 8 nodes or so should handle the load with these "low-risk" changes.

As mentioned though, the load doesn't seem to reflect actual usage, which
is a bit odd. Generally one would assume that load is distributed, so I'm
wary to reduce nodes in a way that you would assume would spread it out to
other nodes in the cluster.

Another thing that might help is upgrading to 0.90. It's faster for most
tasks and it should use less memory. And your indices should get smaller
(btw, are you using store compression on 0.20?).

We just moved to 0.20 the week 0.90 came out. It's a bit hard for us to
upgrade our production environment, so we may not be able to tackle this
immediately. We are using store compression for the two large indices as
well (explicitly storing the _source field anyway), although we may not be
compressing the term vector I see now.

Also, I would assume you have ~140 shards now in total. Since your load

seems to be generated mostly by searches, I would assume a lower number of
shards would help, especially if you want to come down to 5 nodes or so.

This seems counter-intuitive. With more shards, why are we not spreading
the load out across more machines with a higher shard count and decent
distribution of the nodes across the cluster?

Many thanks for the help,

Dale

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, May 14, 2013 at 7:02 PM, Dale Beermann <beer...@gmail.com<javascript:>

wrote:

I'm trying to investigate relatively high load we're seeing across every
node in a 15 instance cluster. So far I haven't been able to track down
exactly what's going on. Performance itself doesn't seem to be a problem.
However, I don't think that we need a 15 node cluster for the size data we
have and I'd like to get that down. I don't feel like I can do that while
we are seeing this type of load across the cluster.

Overall, usage is relatively low right now. Most of our users are college
students so the previous two weeks saw the highest usage of the year.
During that time, however, we had about the same load as we do currently
("es_10_day_cluster_load.png" screenshot attached, pulled from data sent to
graphite). There is little variation and a load of 1.5 across the entire
cluster seems relatively high to me.

I spent part of yesterday morning looking into things a bit further.
Using hot_threads reveals the following:

https://gist.github.com/anonymous/15c49c761ad468a367be

Running "top" on the same instance gives us this (Note there is little
wait time so it doesn't seem to be I/O):

https://gist.github.com/anonymous/bf7cc3a308b4e2d7cd58

We also have bigdesk running but haven't seen much there that indicates
something is amiss. I've attached a screenshot from the node corresponding
to the two previous gists ("es18_bigdesk_output_1.png" and
"es18_bigdesk_output_2.png").

We thought that maybe garbage collection was the problem, but it seems
that we would be seeing higher CPU usage. We also aren't seeing any
warnings in the logs about garbage collection. There are a few slow queries
in the logs but it's happening sporadically and they don't seem like they
are affecting the ongoing high load. For an uptime of 245 hours, there are
about 3.33 hours corresponding to garbage collection as showed in the
bigdesk screenshot. The "user total" for CPU time according to Big Desk
does seem to be about 55 hours, but that seems like a lot given the CPU %.
I can't see where the CPU usage is coming from though.

The bottom line is that I don't think we need 15 instances in order to
handle our search data. Cutting that by 1/3 would save us a load of money.
But we can't just jump to a smaller cluster without knowing that we won't
see higher load per instance.

Thanks in advance for any insight,

Dale

Other info:
Elasticsearch version: 20.6
Java version: 1.7.0_17
AWS Instance size: m2.2xlarge 34.2 GiB of memory
ES heap size: 20GB
Local storiage: 8x 60GB EBS volumes in RAID-0
Indices: 7 in total. The larger indices have 15 shards, the smaller have
5, each with a single replica. The largest index has 133M documents in it
with a size of 145GB (290GB including the replicas).

Happy to provide any additional data we can.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
For more options, visit https://groups.google.com/groups/opt_out.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Hello Dale,

On Tue, May 14, 2013 at 8:55 PM, Dale Beermann beermann@gmail.com wrote:

Thank you for the reply Radu. A few answers inline:

On Tuesday, May 14, 2013 10:15:52 AM UTC-7, Radu Gheorghe wrote:

Hello Dale,

I think the CPU usage comes from the searches. From the Bigdesk
screenshot, I see something like 10 queries per second.

This doesn't seem like a lot to me. Is it normal that there's such a
discrepancy between query and fetch?

I'm not sure. I assume that if a shard doesn't have any match for a query,
it doesn't have anything to fetch, which produces the discrepancy.

Is there any way to track down exactly which searches are causing most of
the load?

You can set lower thresholds for the slowlog in elasticsearch.yml, thinking
that queries taking more time are heavier. I don't see a better option.

20GB to ES seems like too much. The heap usage from BigDesk shows 5GB

usage or so. I'd change the heap size to 10GB, and that would allow more
room for OS caches, which should help the queries and hopefully also lower
the load.

We can try this, although there should still be 10+GB on those machines
for the OS caches, so I can't imagine this is really the issue.

Yeah, but if only your largest index is 145GB or so, spread across 15 nodes
is about 10GB/node, which almost fills your caches.

I'm not sure 100% it helps, but one can try :slight_smile:

Speaking of load, an average of 2 on a 4-CPU machine doesn't sound that
much. O would assume that if you change the heap to 10GB, you can take one
node out of the cluster at a time and see what happens. I would suspect
that 8 nodes or so should handle the load with these "low-risk" changes.

As mentioned though, the load doesn't seem to reflect actual usage, which
is a bit odd.

I'm not sure I follow. You mean you have higher usage (number of searches)
and lower load and the other way around? This might happen sporadically due
to merges, even with light indexing, but if it's the "normal" state for you
then it's weird.

Generally one would assume that load is distributed, so I'm wary to reduce
nodes in a way that you would assume would spread it out to other nodes in
the cluster.

Are you saying that your load isn't distributed now? As in, some machines
are more loaded than others? That might happen if your data isn't spread
evenly between nodes (eg: shards from the bigger index on some machines,
shards from the smaller indices on other machines).

0.90 has a cool new shard balancer which might help. I didn't use it yet,
but I understand it takes shards sizes into account.

Another thing that might help is upgrading to 0.90. It's faster for most
tasks and it should use less memory. And your indices should get smaller
(btw, are you using store compression on 0.20?).

We just moved to 0.20 the week 0.90 came out. It's a bit hard for us to
upgrade our production environment, so we may not be able to tackle this
immediately. We are using store compression for the two large indices as
well (explicitly storing the _source field anyway), although we may not be
compressing the term vector I see now.

Yeah, I completely understand that it's hard to upgrade. But maybe it's an
option to keep in mind if others fail to give you the performance you need.

Also, I would assume you have ~140 shards now in total. Since your load

seems to be generated mostly by searches, I would assume a lower number of
shards would help, especially if you want to come down to 5 nodes or so.

This seems counter-intuitive. With more shards, why are we not spreading
the load out across more machines with a higher shard count and decent
distribution of the nodes across the cluster?

I didn't understand that load spreading was an issue. Can you say a bit
more about it, or am I missing something?

Many thanks for the help,

You're welcome :slight_smile:

Best regards,
Radu

Dale

Best regards,
Radu

http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

On Tue, May 14, 2013 at 7:02 PM, Dale Beermann beer...@gmail.com wrote:

I'm trying to investigate relatively high load we're seeing across every
node in a 15 instance cluster. So far I haven't been able to track down
exactly what's going on. Performance itself doesn't seem to be a problem.
However, I don't think that we need a 15 node cluster for the size data we
have and I'd like to get that down. I don't feel like I can do that while
we are seeing this type of load across the cluster.

Overall, usage is relatively low right now. Most of our users are
college students so the previous two weeks saw the highest usage of the
year. During that time, however, we had about the same load as we do
currently ("es_10_day_cluster_load.png" screenshot attached, pulled from
data sent to graphite). There is little variation and a load of 1.5 across
the entire cluster seems relatively high to me.

I spent part of yesterday morning looking into things a bit further.
Using hot_threads reveals the following:

https://gist.github.com/**anonymous/15c49c761ad468a367behttps://gist.github.com/anonymous/15c49c761ad468a367be

Running "top" on the same instance gives us this (Note there is little
wait time so it doesn't seem to be I/O):

https://gist.github.com/**anonymous/bf7cc3a308b4e2d7cd58https://gist.github.com/anonymous/bf7cc3a308b4e2d7cd58

We also have bigdesk running but haven't seen much there that indicates
something is amiss. I've attached a screenshot from the node corresponding
to the two previous gists ("es18_bigdesk_output_1.png" and
"es18_bigdesk_output_2.png").

We thought that maybe garbage collection was the problem, but it seems
that we would be seeing higher CPU usage. We also aren't seeing any
warnings in the logs about garbage collection. There are a few slow queries
in the logs but it's happening sporadically and they don't seem like they
are affecting the ongoing high load. For an uptime of 245 hours, there are
about 3.33 hours corresponding to garbage collection as showed in the
bigdesk screenshot. The "user total" for CPU time according to Big Desk
does seem to be about 55 hours, but that seems like a lot given the CPU %.
I can't see where the CPU usage is coming from though.

The bottom line is that I don't think we need 15 instances in order to
handle our search data. Cutting that by 1/3 would save us a load of money.
But we can't just jump to a smaller cluster without knowing that we won't
see higher load per instance.

Thanks in advance for any insight,

Dale

Other info:
Elasticsearch version: 20.6
Java version: 1.7.0_17
AWS Instance size: m2.2xlarge 34.2 GiB of memory
ES heap size: 20GB
Local storiage: 8x 60GB EBS volumes in RAID-0
Indices: 7 in total. The larger indices have 15 shards, the smaller have
5, each with a single replica. The largest index has 133M documents in it
with a size of 145GB (290GB including the replicas).

Happy to provide any additional data we can.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.

For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

You are kidding. When I run heavy tasks and crunch data with ES, I can
push the load up to 100.00 on a single node, and the system is still
responsive because of I/O subsystem capacity.

From your numbers, your system is idling.

Cheers,

Jörg

Am 14.05.13 18:02, schrieb Dale Beermann:

There is little variation and a load of 1.5 across the entire cluster
seems relatively high to me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

The problem is in understanding whether or not we can reduce the size of
our cluster. As mentioned, I don't feel like 15 nodes is necessary for the
size of our data and our usage patterns. In the simplest scenario, this
would be equivalent to understanding if a load of 2 on a two-shard cluster
would result in a load of 4 on a single shard cluster, which in my
experience, would be extremely high.

To clear up some of Radu's confusion:

We currently have good load distribution across the cluster, as shown in
the original graph.

The load does not seem to be directly tied to our general site usage. With
very few users online, we never drop below a load of 1 across the cluster.
With 6X the uses online, we rarely go above a load of 2-2.5. Again, this is
reflected in the graph. I'm hoping to track down where the load is coming
from in order to understand what the effects might be of reducing our
cluster size.

The current take away is that our options are to:

  1. Increase the slow log threshold.
  2. Try reducing the number of shards.
    • This requires a full re-index which is a bit difficult. In
      addition, it's risky as we are just assuming load will not spike as a
      result of doing this.
  3. Try reducing the memory allocated to ES.
    • It doesn't seem that this will have much of an effect, unless
      garbage collection is the big issue, due to the fact that I/O is very low
      and system level caching does not seem to be playing into the scenario.

I'm still hoping to find a better way of tracking down what exactly is
causing the load.

Dale

On Tuesday, May 14, 2013 5:52:35 PM UTC-7, Jörg Prante wrote:

You are kidding. When I run heavy tasks and crunch data with ES, I can
push the load up to 100.00 on a single node, and the system is still
responsive because of I/O subsystem capacity.

From your numbers, your system is idling.

Cheers,

Jörg

Am 14.05.13 18:02, schrieb Dale Beermann:

There is little variation and a load of 1.5 across the entire cluster
seems relatively high to me.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Dale, try to take one node out and see what happens. It's easier to
diagnose when you actually have the high load and you can diagnose it.

If you don't have high load (and I suspect you won't), then take another
one and so on.

It's hard do troubleshoot performance issues that aren't happening, I guess
that was Joerg's point. You might also find that the are no performance
issues at all (like having a 5-6 node cluster handle everything).

Best regards,
Radu

On Thu, May 16, 2013 at 1:56 AM, Dale Beermann beermann@gmail.com wrote:

The problem is in understanding whether or not we can reduce the size of
our cluster. As mentioned, I don't feel like 15 nodes is necessary for the
size of our data and our usage patterns. In the simplest scenario, this
would be equivalent to understanding if a load of 2 on a two-shard cluster
would result in a load of 4 on a single shard cluster, which in my
experience, would be extremely high.

To clear up some of Radu's confusion:

We currently have good load distribution across the cluster, as shown in
the original graph.

The load does not seem to be directly tied to our general site usage. With
very few users online, we never drop below a load of 1 across the cluster.
With 6X the uses online, we rarely go above a load of 2-2.5. Again, this is
reflected in the graph. I'm hoping to track down where the load is coming
from in order to understand what the effects might be of reducing our
cluster size.

The current take away is that our options are to:

  1. Increase the slow log threshold.
  2. Try reducing the number of shards.
    • This requires a full re-index which is a bit difficult. In
      addition, it's risky as we are just assuming load will not spike as a
      result of doing this.
  3. Try reducing the memory allocated to ES.
    • It doesn't seem that this will have much of an effect, unless
      garbage collection is the big issue, due to the fact that I/O is very low
      and system level caching does not seem to be playing into the scenario.

I'm still hoping to find a better way of tracking down what exactly is
causing the load.

Dale

On Tuesday, May 14, 2013 5:52:35 PM UTC-7, Jörg Prante wrote:

You are kidding. When I run heavy tasks and crunch data with ES, I can
push the load up to 100.00 on a single node, and the system is still
responsive because of I/O subsystem capacity.

From your numbers, your system is idling.

Cheers,

Jörg

Am 14.05.13 18:02, schrieb Dale Beermann:

There is little variation and a load of 1.5 across the entire cluster
seems relatively high to me.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
http://sematext.com/ -- ElasticSearch -- Solr -- Lucene

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

It's hard do troubleshoot performance issues that aren't happening,

++

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Do you have a test system?

One method is to grab logged queries from production, and fire them
against an idle ES test cluster and see how load increases. Same holds
for indexing data.

It is incredibly difficult to shrink a running system in order to
identify load causes. I expect you will either see same load or little
worse load.

Jörg

Am 16.05.13 00:56, schrieb Dale Beermann:

I'm still hoping to find a better way of tracking down what exactly is
causing the load.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.