Elasticsearch cpu usage

I've been noticing that our elasticsearch servers have been using more cpu
than our database and web servers lately. I attached a screenshot of our
cpu graph. The green and blue lines are the two elasticsearch servers that
we have. The yellow and pink lines are the web servers and the other two
are the database servers. Our elasticsearch servers are on 0.19.1 out on
EC2 large instances. The search servers each have three indexes at four
shards per index. Thoughts on getting the cpu usage to come down or
tracking down why the cpu usage is so high?

One of the search servers finally dropped down, the other is still going.

More information is needed even to speculate on what the problem may be.
Search server using more CPU by itself does not necessarily indicate a
problem, may be more data is indexed, may be there are more queries,
facets, etc. You may want to use BigDesk plugin to see what ES is doing,
what's the memory consumption, etc.

Regards,
Berkay Mollamustafaoglu
mberkay on yahoo, google and skype

On Fri, Apr 6, 2012 at 10:11 AM, danpolites dpolites@gmail.com wrote:

One of the search servers finally dropped down, the other is still going.

BigDesk is great, I haven't heard of it before. It looks like the server
that is still riding high is doing a good bit of garbage collecting. I
attached some screenshots from BigDesk for the server with the high cpu
usage. I'm looking for ways to trace down the problem and bigdesk is a
great start.

On Friday, April 6, 2012 9:00:20 AM UTC-4, danpolites wrote:

I've been noticing that our elasticsearch servers have been using more cpu
than our database and web servers lately. I attached a screenshot of our
cpu graph. The green and blue lines are the two elasticsearch servers that
we have. The yellow and pink lines are the web servers and the other two
are the database servers. Our elasticsearch servers are on 0.19.1 out on
EC2 large instances. The search servers each have three indexes at four
shards per index. Thoughts on getting the cpu usage to come down or
tracking down why the cpu usage is so high?

Can you issue a jstack command on the relevant node, it seems to be busy
(executing search and indexing requests...), so maybe CPU usage is
justified?

On Fri, Apr 6, 2012 at 6:13 PM, danpolites dpolites@gmail.com wrote:

BigDesk is great, I haven't heard of it before. It looks like the server
that is still riding high is doing a good bit of garbage collecting. I
attached some screenshots from BigDesk for the server with the high cpu
usage. I'm looking for ways to trace down the problem and bigdesk is a
great start.

On Friday, April 6, 2012 9:00:20 AM UTC-4, danpolites wrote:

I've been noticing that our elasticsearch servers have been using more
cpu than our database and web servers lately. I attached a screenshot of
our cpu graph. The green and blue lines are the two elasticsearch servers
that we have. The yellow and pink lines are the web servers and the other
two are the database servers. Our elasticsearch servers are on 0.19.1 out
on EC2 large instances. The search servers each have three indexes at four
shards per index. Thoughts on getting the cpu usage to come down or
tracking down why the cpu usage is so high?

Here is the jstack output for both search nodes. Also included an updated
cpu usage chart.

Found this in the log for search node 1:

[2012-04-10 18:02:36,748][WARN ][monitor.jvm ] [Havok]
[gc][ParNew][1004813][1418891] duration [1.8s], collections [1]/[2.1s],
total [1.8s]/[14.6h], memory [2.9gb]->[2.9gb]/[3.9gb]

Which OS are you using on the m1 large instances? I recommend using xlarge
instances on AWS typically, btw.

On Tue, Apr 10, 2012 at 9:22 PM, danpolites dpolites@gmail.com wrote:

Found this in the log for search node 1:

[2012-04-10 18:02:36,748][WARN ][monitor.jvm ] [Havok]
[gc][ParNew][1004813][1418891] duration [1.8s], collections [1]/[2.1s],
total [1.8s]/[14.6h], memory [2.9gb]->[2.9gb]/[3.9gb]

We are using Ubuntu 11.10 64 bit AMIs. Is there a better OS? Thanks for
the heads up on the xlarge, we will have to give that a try.

I was concerned regarding a bug in Xen (AWS) with Ubuntu, but 11.10 should
be good. Based on the bigdesk stats, it seems like the CPU usage is valid,
because there is activity going on the machine. I would recommend using
larger instances to reduce the noisy neighbor affect (And higher IO) with
AWS. Also, make sure to allocate ~ half the memory to the JVM.

On Wed, Apr 11, 2012 at 2:59 PM, danpolites dpolites@gmail.com wrote:

We are using Ubuntu 11.10 64 bit AMIs. Is there a better OS? Thanks for

the heads up on the xlarge, we will have to give that a try.

Thanks, we will give that a try. Each search node stores its data on a
single 200GB EBS volume. Do you know of any performance issues related to
the EBS volumes? We have been trying to find the time to get our Mongo EBS
volumes setup in a raid configuration based on recommendations from the 10
gen white papers. I'm curious if the same applies to the elasticsearch
data. I'm not sure that 200GB of data would even be enough to warrant a
raid configuration, though.

From what I know, RAID helps a lot with EBS. Note also, that in ES, you can
configure multiple data locations (mounts) for the data directory.

On Wed, Apr 11, 2012 at 3:30 PM, danpolites dpolites@gmail.com wrote:

Thanks, we will give that a try. Each search node stores its data on a
single 200GB EBS volume. Do you know of any performance issues related to
the EBS volumes? We have been trying to find the time to get our Mongo EBS
volumes setup in a raid configuration based on recommendations from the 10
gen white papers. I'm curious if the same applies to the elasticsearch
data. I'm not sure that 200GB of data would even be enough to warrant a
raid configuration, though.

Today our search servers stopped responding with the web servers. The
search nodes are the yellow and purple lines in the screenshot. You can see
where the search servers dropped out because all of the other lines in the
chart dropped. The web servers stayed running and the search clients and
were able to reconnect to the search nodes after we restarted both of the
search nodes. I realize they are busy, but I would like to understand why
the search nodes are using double the CPU as the web and database nodes. We
are going to upgrade to extra large instances this weekend, but I'm not
confident that will fix the problem. Thoughts?

Here is the log from the first search node:

[2012-04-13 16:59:52,843][INFO ][monitor.jvm ] [Havok]
[gc][ConcurrentMarkSweep][1259230][7831] duration [11s], collections
[2]/[34.3s], total [11s]/[12.8m], memory [3.9gb]->[3.9gb]/[3.9gb]

Here is the log from the second search node:

[2012-04-13 16:56:39,436][WARN ][monitor.jvm ] [Buzz]
[gc][ParNew][1260165][1153416] duration [4.1s], collections [1]/[28.1s],
total [4.1s]/[12.4h], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 16:57:13,058][INFO ][monitor.jvm ] [Buzz]
[gc][ConcurrentMarkSweep][1260167][8995] duration [10s], collections
[2]/[32.6s], total [10s]/[14.2m], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 16:58:06,178][INFO ][monitor.jvm ] [Buzz]
[gc][ConcurrentMarkSweep][1260169][8999] duration [10.2s], collections
[2]/[31.7s], total [10.2s]/[14.3m], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 16:58:59,574][INFO ][monitor.jvm ] [Buzz]
[gc][ConcurrentMarkSweep][1260171][9003] duration [10.2s], collections
[2]/[31.8s], total [10.2s]/[14.5m], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 16:59:21,233][WARN ][transport ] [Buzz] Received
response for a request that has timed out, sent [106786ms] ago, timed out
[53452ms] ago, action [discovery/zen/fd/ping], node
[[Havok][Jl_d665RStKSr5DwkOYt3w][inet[/10.99.21.151:9300]]], id [7919030]
[2012-04-13 16:59:21,233][WARN ][transport ] [Buzz] Received
response for a request that has timed out, sent [53452ms] ago, timed out
[0ms] ago, action [discovery/zen/fd/ping], node
[[Havok][Jl_d665RStKSr5DwkOYt3w][inet[/10.99.21.151:9300]]], id [7919127]
[2012-04-13 17:00:46,096][INFO ][monitor.jvm ] [Buzz]
[gc][ConcurrentMarkSweep][1260174][9011] duration [10.2s], collections
[2]/[31.7s], total [10.2s]/[14.8m], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 17:01:07,763][WARN ][transport ] [Buzz] Received
response for a request that has timed out, sent [53413ms] ago, timed out
[21662ms] ago, action [discovery/zen/fd/ping], node
[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.195:9300]]{client=true,
data=false}], id [7919277]
[2012-04-13 17:01:07,772][WARN ][transport ] [Buzz] Received
response for a request that has timed out, sent [53423ms] ago, timed out
[21671ms] ago, action [discovery/zen/fd/ping], node
[[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false}], id [7919275]
[2012-04-13 17:01:39,638][INFO ][monitor.jvm ] [Buzz]
[gc][ConcurrentMarkSweep][1260176][9015] duration [10.2s], collections
[2]/[31.8s], total [10.2s]/[15m], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 17:02:18,353][WARN ][transport ] [Buzz] Received
response for a request that has timed out, sent [124003ms] ago, timed out
[92253ms] ago, action [discovery/zen/fd/ping], node
[[Havok][Jl_d665RStKSr5DwkOYt3w][inet[/10.99.21.151:9300]]], id [7919276]
[2012-04-13 17:07:20,416][WARN ][discovery.ec2 ] [Buzz] received
a join request for an existing node
[[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false}]

The search nodes went down again, but this time the charts didn't spike to
100%, but they flat lined at about 50% and then they recovered on their own.

Logs from search server 1

[2012-04-13 17:12:19,081][INFO ][bootstrap ] max_open_files
[63977]
[2012-04-13 17:12:19,113][INFO ][node ] [Turner D.
Century] {0.19.1}[793]: initializing ...
[2012-04-13 17:12:19,323][INFO ][plugins ] [Turner D.
Century] loaded [cloud-aws], sites []
[2012-04-13 17:12:23,465][INFO ][node ] [Turner D.
Century] {0.19.1}[793]: initialized
[2012-04-13 17:12:23,465][INFO ][node ] [Turner D.
Century] {0.19.1}[793]: starting ...
[2012-04-13 17:12:23,678][INFO ][transport ] [Turner D.
Century] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.99.21.151:9300]}
[2012-04-13 17:12:31,518][INFO ][discovery.ec2 ] [Turner D.
Century] failed to send join request to master
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]], reason
[org.elasticsearch.ElasticSearchTimeoutException: Timeout waiting for task.]
[2012-04-13 17:12:31,682][INFO ][cluster.service ] [Turner D.
Century] detected_master
[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]], added
{[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false},[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false},[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]],},
reason: zen-disco-receive(from master
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]])
[2012-04-13 17:12:31,731][INFO ][discovery ] [Turner D.
Century] iob_prod/DHPrtSVnSYmA-RWFxO-_bQ
[2012-04-13 17:12:31,734][INFO ][http ] [Turner D.
Century] bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.99.21.151:9200]}
[2012-04-13 17:12:31,735][INFO ][node ] [Turner D.
Century] {0.19.1}[793]: started
[2012-04-13 19:26:29,861][WARN ][transport ] [Turner D.
Century] Received response for a request that has timed out, sent [32183ms]
ago, timed out [2183ms] ago, action [discovery/zen/fd/masterPing], node
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]], id [7948]
[2012-04-13 19:28:00,964][INFO ][discovery.ec2 ] [Turner D.
Century] master_left
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]], reason
[failed to ping, tried [3] times, each with maximum [30s] timeout]
[2012-04-13 19:28:00,971][INFO ][cluster.service ] [Turner D.
Century] master {new [Turner D.
Century][DHPrtSVnSYmA-RWFxO-_bQ][inet[/10.99.21.151:9300]], previous
[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]}, removed
{[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]],}, reason:
zen-disco-master_failed
([Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]])
[2012-04-13 19:29:04,792][INFO ][monitor.jvm ] [Turner D.
Century] [gc][ConcurrentMarkSweep][8115][139] duration [10s], collections
[2]/[32.2s], total [10s]/[20.8s], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 19:29:36,846][INFO ][monitor.jvm ] [Turner D.
Century] [gc][ConcurrentMarkSweep][8116][141] duration [10.1s], collections
[2]/[32s], total [10.1s]/[30.9s], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 19:31:45,642][INFO ][monitor.jvm ] [Turner D.
Century] [gc][ConcurrentMarkSweep][8121][151] duration [10s], collections
[2]/[31.7s], total [10s]/[51.1s], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 19:31:45,783][WARN ][discovery.ec2 ] [Turner D.
Century] master should not receive new cluster state from
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]]
[2012-04-13 19:32:07,355][WARN ][discovery.ec2 ] [Turner D.
Century] received a join request for an existing node
[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false}]
[2012-04-13 19:33:10,922][WARN ][discovery.ec2 ] [Turner D.
Century] received a join request for an existing node
[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false}]
[2012-04-13 19:33:11,205][WARN ][discovery.ec2 ] [Turner D.
Century] master should not receive new cluster state from
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]]
[2012-04-13 19:33:11,222][WARN ][discovery.ec2 ] [Turner D.
Century] master should not receive new cluster state from
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]]
[2012-04-13 19:33:11,635][WARN ][discovery.ec2 ] [Turner D.
Century] received a join request for an existing node
[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false}]
[2012-04-13 19:35:08,538][WARN ][discovery.ec2 ] [Turner D.
Century] master should not receive new cluster state from
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]]
[2012-04-13 19:35:08,541][WARN ][discovery.ec2 ] [Turner D.
Century] master should not receive new cluster state from
[[Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]]]
[2012-04-13 19:35:40,438][WARN ][discovery.ec2 ] [Turner D.
Century] received a join request for an existing node
[[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false}]
[2012-04-13 19:45:17,217][INFO ][bootstrap ] max_open_files
[63978]
[2012-04-13 19:45:17,246][INFO ][node ] [Rintrah]
{0.19.1}[6053]: initializing ...
[2012-04-13 19:45:17,417][INFO ][plugins ] [Rintrah]
loaded [cloud-aws], sites []
[2012-04-13 19:45:20,532][INFO ][node ] [Rintrah]
{0.19.1}[6053]: initialized
[2012-04-13 19:45:20,532][INFO ][node ] [Rintrah]
{0.19.1}[6053]: starting ...
[2012-04-13 19:45:20,674][INFO ][transport ] [Rintrah]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.99.21.151:9300]}
[2012-04-13 19:45:26,187][INFO ][cluster.service ] [Rintrah]
detected_master [Brother
Tode][NGrwag9qRcuV2iWNjvZ2Sg][inet[/10.88.226.198:9300]], added
{[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false},[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false},[Brother
Tode][NGrwag9qRcuV2iWNjvZ2Sg][inet[/10.88.226.198:9300]],}, reason:
zen-disco-receive(from master [[Brother
Tode][NGrwag9qRcuV2iWNjvZ2Sg][inet[/10.88.226.198:9300]]])
[2012-04-13 19:45:26,228][INFO ][discovery ] [Rintrah]
iob_prod/Ex0dTVmKSi2YoFL6fEsmbw
[2012-04-13 19:45:26,238][INFO ][http ] [Rintrah]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.99.21.151:9200]}
[2012-04-13 19:45:26,239][INFO ][node ] [Rintrah]
{0.19.1}[6053]: started

Logs from search server 2

[2012-04-13 17:10:48,151][INFO ][bootstrap ] max_open_files
[63977]
[2012-04-13 17:10:48,171][INFO ][node ] [Brutacus]
{0.19.1}[800]: initializing ...
[2012-04-13 17:10:48,281][INFO ][plugins ] [Brutacus]
loaded [cloud-aws], sites []
[2012-04-13 17:10:52,721][INFO ][node ] [Brutacus]
{0.19.1}[800]: initialized
[2012-04-13 17:10:52,721][INFO ][node ] [Brutacus]
{0.19.1}[800]: starting ...
[2012-04-13 17:10:52,845][INFO ][transport ] [Brutacus]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.88.226.198:9300]}
[2012-04-13 17:10:57,625][INFO ][cluster.service ] [Brutacus]
new_master [Brutacus][K9qvypyTSdGBE743MAlERQ][inet[/10.88.226.198:9300]],
reason: zen-disco-join (elected_as_master)
[2012-04-13 17:10:57,762][INFO ][discovery ] [Brutacus]
iob_prod/K9qvypyTSdGBE743MAlERQ
[2012-04-13 17:10:57,836][INFO ][http ] [Brutacus]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.88.226.198:9200]}
[2012-04-13 17:10:57,836][INFO ][node ] [Brutacus]
{0.19.1}[800]: started
[2012-04-13 17:11:06,049][INFO ][gateway ] [Brutacus]
recovered [3] indices into cluster_state
[2012-04-13 17:11:10,593][INFO ][cluster.service ] [Brutacus]
added
{[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false}])
[2012-04-13 17:11:10,612][INFO ][cluster.service ] [Brutacus]
added
{[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false}])
[2012-04-13 17:12:31,648][INFO ][cluster.service ] [Brutacus]
added {[Turner D.
Century][DHPrtSVnSYmA-RWFxO-_bQ][inet[/10.99.21.151:9300]],}, reason:
zen-disco-receive(join from node[[Turner D.
Century][DHPrtSVnSYmA-RWFxO-_bQ][inet[/10.99.21.151:9300]]])
[2012-04-13 17:12:35,072][WARN ][discovery.ec2 ] [Brutacus]
received a join request for an existing node [[Turner D.
Century][DHPrtSVnSYmA-RWFxO-_bQ][inet[/10.99.21.151:9300]]]
[2012-04-13 19:26:29,847][INFO ][monitor.jvm ] [Brutacus]
[gc][ConcurrentMarkSweep][8023][190] duration [10.2s], collections
[2]/[33.4s], total [10.2s]/[23.7s], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 19:27:24,297][INFO ][monitor.jvm ] [Brutacus]
[gc][ConcurrentMarkSweep][8025][194] duration [10.3s], collections
[2]/[32.4s], total [10.3s]/[34s], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 19:28:19,044][INFO ][monitor.jvm ] [Brutacus]
[gc][ConcurrentMarkSweep][8027][198] duration [10.4s], collections
[2]/[32.7s], total [10.4s]/[44.5s], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 19:29:46,028][INFO ][monitor.jvm ] [Brutacus]
[gc][ConcurrentMarkSweep][8029][204] duration [10.4s], collections
[2]/[32.4s], total [10.4s]/[1m], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 19:30:40,735][INFO ][monitor.jvm ] [Brutacus]
[gc][ConcurrentMarkSweep][8031][208] duration [10.3s], collections
[2]/[32.5s], total [10.3s]/[1.2m], memory [3.9gb]->[3.9gb]/[3.9gb]
[2012-04-13 19:30:40,826][WARN ][transport ] [Brutacus]
Received response for a request that has timed out, sent [119656ms] ago,
timed out [87010ms] ago, action [discovery/zen/fd/ping], node [[Turner D.
Century][DHPrtSVnSYmA-RWFxO-_bQ][inet[/10.99.21.151:9300]]], id [214145]
[2012-04-13 19:31:35,891][WARN ][discovery.ec2 ] [Brutacus]
received a join request for an existing node
[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false}]
[2012-04-13 19:32:08,700][WARN ][discovery.ec2 ] [Brutacus]
received a join request for an existing node
[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false}]
[2012-04-13 19:32:30,874][WARN ][discovery.ec2 ] [Brutacus]
received a join request for an existing node
[[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false}]
[2012-04-13 19:33:25,779][WARN ][discovery.ec2 ] [Brutacus]
received a join request for an existing node
[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false}]
[2012-04-13 19:33:25,824][WARN ][discovery.ec2 ] [Brutacus]
received a join request for an existing node
[[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false}]
[2012-04-13 19:33:25,828][WARN ][transport.netty ] [Brutacus]
Message read past expected size (request) for [2275160] and action
[discovery/zen/unicast], resetting
[2012-04-13 19:41:58,026][INFO ][bootstrap ] max_open_files
[63978]
[2012-04-13 19:41:58,057][INFO ][node ] [Brother Tode]
{0.19.1}[6388]: initializing ...
[2012-04-13 19:41:58,110][INFO ][plugins ] [Brother Tode]
loaded [cloud-aws], sites []
[2012-04-13 19:42:01,627][INFO ][node ] [Brother Tode]
{0.19.1}[6388]: initialized
[2012-04-13 19:42:01,628][INFO ][node ] [Brother Tode]
{0.19.1}[6388]: starting ...
[2012-04-13 19:42:01,877][INFO ][transport ] [Brother Tode]
bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address
{inet[/10.88.226.198:9300]}
[2012-04-13 19:42:06,783][WARN ][discovery.zen.ping.unicast] [Brother Tode]
failed to send ping to [[#cloud-i-4a6b6028-0][inet[/10.99.21.151:9300]]]
org.elasticsearch.transport.ReceiveTimeoutTransportException:
[][inet[/10.99.21.151:9300]][discovery/zen/unicast] request_id [1] timed
out after [3750ms]
at
org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:347)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)
[2012-04-13 19:42:07,801][INFO ][cluster.service ] [Brother Tode]
new_master [Brother
Tode][NGrwag9qRcuV2iWNjvZ2Sg][inet[/10.88.226.198:9300]], reason:
zen-disco-join (elected_as_master)
[2012-04-13 19:42:07,808][INFO ][discovery ] [Brother Tode]
iob_prod/NGrwag9qRcuV2iWNjvZ2Sg
[2012-04-13 19:42:07,828][INFO ][http ] [Brother Tode]
bound_address {inet[/0:0:0:0:0:0:0:0:9200]}, publish_address
{inet[/10.88.226.198:9200]}
[2012-04-13 19:42:07,828][INFO ][node ] [Brother Tode]
{0.19.1}[6388]: started
[2012-04-13 19:42:10,674][INFO ][gateway ] [Brother Tode]
recovered [3] indices into cluster_state
[2012-04-13 19:42:15,543][INFO ][cluster.service ] [Brother Tode]
added
{[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Amazon][VYGlv1moSsWTsvHiNX2ZQg][inet[/10.195.83.46:9300]]{client=true,
data=false}])
[2012-04-13 19:42:15,549][INFO ][cluster.service ] [Brother Tode]
added
{[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false},}, reason: zen-disco-receive(join from
node[[Songbird][3jSTLhuDQ--nX6ncH52iog][inet[/10.123.33.191:9300]]{client=true,
data=false}])

We have one index that is for tracking the requests made by each user
throughout the day. We are suspicious that maybe this is causing our
performance issues due to the number of requests that we are seeing.
Currently that index is at 34gb with about 20 million documents for one
type. We are going to increase the refresh interval from the default of 1s
to 120s on that index because we only use it to trouble shoot and track
things down. Basically we just don't use it that often. My question is what
kind of hit will we take if the index backs up thousands of documents
before the refresh occurs?

We set the refresh interval to 120s for our request tracker index and we
set the replication to 0 because it's ok to lose that data and our web
application is blazing fast! Granted our users have died down a little for
the day, but the site has never been this fast. The green and blue spikes
in the screenshot show where we set these settings and restarted the search
nodes. We will see how things go over the weekend and into next week.

Hi,

Based on the logs you pasted, it seems like the the nodes do not have
enough memory (heap). Which causes problems since the garbage collector is
running but it can't clear anything. You probably need to allocate more
memory to the nodes (if the machine you have enough for it), or use bigger
memory boxes. Can you verify it with bigdesk? It should show you the memory
usage (JVM heap).

On Sat, Apr 14, 2012 at 12:47 AM, danpolites dpolites@gmail.com wrote:

We set the refresh interval to 120s for our request tracker index and we
set the replication to 0 because it's ok to lose that data and our web
application is blazing fast! Granted our users have died down a little for
the day, but the site has never been this fast. The green and blue spikes
in the screenshot show where we set these settings and restarted the search
nodes. We will see how things go over the weekend and into next week.

We took your advice and upgraded to the extra large instances yesterday
evening. We have allocated 10g of the available 15g to elasticsearch. So
far this morning the CPU usage is under 10%. I attached bigdesk charts from
our new master. At this point, our master is using 7g of the 10g. Also, can
explain how the GC Delta works? What should I be looking for there?

On Saturday, April 14, 2012 1:17:12 PM UTC-4, kimchy wrote:

Hi,

Based on the logs you pasted, it seems like the the nodes do not have
enough memory (heap). Which causes problems since the garbage collector is
running but it can't clear anything. You probably need to allocate more
memory to the nodes (if the machine you have enough for it), or use bigger
memory boxes. Can you verify it with bigdesk? It should show you the memory
usage (JVM heap).

On Sat, Apr 14, 2012 at 12:47 AM, danpolites wrote:

We set the refresh interval to 120s for our request tracker index and we
set the replication to 0 because it's ok to lose that data and our web
application is blazing fast! Granted our users have died down a little for
the day, but the site has never been this fast. The green and blue spikes
in the screenshot show where we set these settings and restarted the search
nodes. We will see how things go over the weekend and into next week.

I think GC delta means the time spend GC'ing between one call to the next.
Currently, its not really interesting in bigdesk, the more interesting part
is when we have separate graphs for each heap generation.

On Fri, Apr 20, 2012 at 3:25 PM, danpolites dpolites@gmail.com wrote:

We took your advice and upgraded to the extra large instances yesterday
evening. We have allocated 10g of the available 15g to elasticsearch. So
far this morning the CPU usage is under 10%. I attached bigdesk charts from
our new master. At this point, our master is using 7g of the 10g. Also, can
explain how the GC Delta works? What should I be looking for there?

On Saturday, April 14, 2012 1:17:12 PM UTC-4, kimchy wrote:

Hi,

Based on the logs you pasted, it seems like the the nodes do not have
enough memory (heap). Which causes problems since the garbage collector is
running but it can't clear anything. You probably need to allocate more
memory to the nodes (if the machine you have enough for it), or use bigger
memory boxes. Can you verify it with bigdesk? It should show you the memory
usage (JVM heap).

On Sat, Apr 14, 2012 at 12:47 AM, danpolites wrote:

We set the refresh interval to 120s for our request tracker index and we

set the replication to 0 because it's ok to lose that data and our web
application is blazing fast! Granted our users have died down a little for
the day, but the site has never been this fast. The green and blue spikes
in the screenshot show where we set these settings and restarted the search
nodes. We will see how things go over the weekend and into next week.