Hello Dale,
On Tue, May 14, 2013 at 8:55 PM, Dale Beermann beermann@gmail.com wrote:
Thank you for the reply Radu. A few answers inline:
On Tuesday, May 14, 2013 10:15:52 AM UTC-7, Radu Gheorghe wrote:
Hello Dale,
I think the CPU usage comes from the searches. From the Bigdesk
screenshot, I see something like 10 queries per second.
This doesn't seem like a lot to me. Is it normal that there's such a
discrepancy between query and fetch?
I'm not sure. I assume that if a shard doesn't have any match for a query,
it doesn't have anything to fetch, which produces the discrepancy.
Is there any way to track down exactly which searches are causing most of
the load?
You can set lower thresholds for the slowlog in elasticsearch.yml, thinking
that queries taking more time are heavier. I don't see a better option.
20GB to ES seems like too much. The heap usage from BigDesk shows 5GB
usage or so. I'd change the heap size to 10GB, and that would allow more
room for OS caches, which should help the queries and hopefully also lower
the load.
We can try this, although there should still be 10+GB on those machines
for the OS caches, so I can't imagine this is really the issue.
Yeah, but if only your largest index is 145GB or so, spread across 15 nodes
is about 10GB/node, which almost fills your caches.
I'm not sure 100% it helps, but one can try
Speaking of load, an average of 2 on a 4-CPU machine doesn't sound that
much. O would assume that if you change the heap to 10GB, you can take one
node out of the cluster at a time and see what happens. I would suspect
that 8 nodes or so should handle the load with these "low-risk" changes.
As mentioned though, the load doesn't seem to reflect actual usage, which
is a bit odd.
I'm not sure I follow. You mean you have higher usage (number of searches)
and lower load and the other way around? This might happen sporadically due
to merges, even with light indexing, but if it's the "normal" state for you
then it's weird.
Generally one would assume that load is distributed, so I'm wary to reduce
nodes in a way that you would assume would spread it out to other nodes in
the cluster.
Are you saying that your load isn't distributed now? As in, some machines
are more loaded than others? That might happen if your data isn't spread
evenly between nodes (eg: shards from the bigger index on some machines,
shards from the smaller indices on other machines).
0.90 has a cool new shard balancer which might help. I didn't use it yet,
but I understand it takes shards sizes into account.
Another thing that might help is upgrading to 0.90. It's faster for most
tasks and it should use less memory. And your indices should get smaller
(btw, are you using store compression on 0.20?).
We just moved to 0.20 the week 0.90 came out. It's a bit hard for us to
upgrade our production environment, so we may not be able to tackle this
immediately. We are using store compression for the two large indices as
well (explicitly storing the _source field anyway), although we may not be
compressing the term vector I see now.
Yeah, I completely understand that it's hard to upgrade. But maybe it's an
option to keep in mind if others fail to give you the performance you need.
Also, I would assume you have ~140 shards now in total. Since your load
seems to be generated mostly by searches, I would assume a lower number of
shards would help, especially if you want to come down to 5 nodes or so.
This seems counter-intuitive. With more shards, why are we not spreading
the load out across more machines with a higher shard count and decent
distribution of the nodes across the cluster?
I didn't understand that load spreading was an issue. Can you say a bit
more about it, or am I missing something?
Many thanks for the help,
You're welcome
Best regards,
Radu
Dale
Best regards,
Radu
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene
On Tue, May 14, 2013 at 7:02 PM, Dale Beermann beer...@gmail.com wrote:
I'm trying to investigate relatively high load we're seeing across every
node in a 15 instance cluster. So far I haven't been able to track down
exactly what's going on. Performance itself doesn't seem to be a problem.
However, I don't think that we need a 15 node cluster for the size data we
have and I'd like to get that down. I don't feel like I can do that while
we are seeing this type of load across the cluster.
Overall, usage is relatively low right now. Most of our users are
college students so the previous two weeks saw the highest usage of the
year. During that time, however, we had about the same load as we do
currently ("es_10_day_cluster_load.png" screenshot attached, pulled from
data sent to graphite). There is little variation and a load of 1.5 across
the entire cluster seems relatively high to me.
I spent part of yesterday morning looking into things a bit further.
Using hot_threads reveals the following:
https://gist.github.com/**anonymous/15c49c761ad468a367behttps://gist.github.com/anonymous/15c49c761ad468a367be
Running "top" on the same instance gives us this (Note there is little
wait time so it doesn't seem to be I/O):
https://gist.github.com/**anonymous/bf7cc3a308b4e2d7cd58https://gist.github.com/anonymous/bf7cc3a308b4e2d7cd58
We also have bigdesk running but haven't seen much there that indicates
something is amiss. I've attached a screenshot from the node corresponding
to the two previous gists ("es18_bigdesk_output_1.png" and
"es18_bigdesk_output_2.png").
We thought that maybe garbage collection was the problem, but it seems
that we would be seeing higher CPU usage. We also aren't seeing any
warnings in the logs about garbage collection. There are a few slow queries
in the logs but it's happening sporadically and they don't seem like they
are affecting the ongoing high load. For an uptime of 245 hours, there are
about 3.33 hours corresponding to garbage collection as showed in the
bigdesk screenshot. The "user total" for CPU time according to Big Desk
does seem to be about 55 hours, but that seems like a lot given the CPU %.
I can't see where the CPU usage is coming from though.
The bottom line is that I don't think we need 15 instances in order to
handle our search data. Cutting that by 1/3 would save us a load of money.
But we can't just jump to a smaller cluster without knowing that we won't
see higher load per instance.
Thanks in advance for any insight,
Dale
Other info:
Elasticsearch version: 20.6
Java version: 1.7.0_17
AWS Instance size: m2.2xlarge 34.2 GiB of memory
ES heap size: 20GB
Local storiage: 8x 60GB EBS volumes in RAID-0
Indices: 7 in total. The larger indices have 15 shards, the smaller have
5, each with a single replica. The largest index has 133M documents in it
with a size of 145GB (290GB including the replicas).
Happy to provide any additional data we can.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@**googlegroups.com.
For more options, visit https://groups.google.com/**groups/opt_outhttps://groups.google.com/groups/opt_out
.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
--
http://sematext.com/ -- Elasticsearch -- Solr -- Lucene
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.