ES 1.0.3 CPU usage drastically increased

Hi,

We're running a two-nodes ES 1.0.3 cluster with the current setup :

VM on host A :
4 vCore CPU
32GB RAM
ES master (only node being queried)
MySQL slave (used as a backup, never queried)
JVM settings

/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Xms2g -Xmx2g -Xss256k
-Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.0.3.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.config=/etc/elasticsearch/elasticsearch.yml
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/home/log/elasticsearch
-Des.default.path.data=/home/elasticsearch
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch

VM on host B :
2 vCore CPU
16GB RAM
ES datanode (search are dispatched, no indexing)
MySQL master
JVM settings

/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Xms2g -Xmx2g -Xss256k
-Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.0.3.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.config=/etc/elasticsearch/elasticsearch.yml
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/home/log/elasticsearch
-Des.default.path.data=/home/elasticsearch
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch


Before we got a 4 vCore CPU / 32GB VM, the master node was the same as the
secondary node.
On this cluster, we have a 5 shards (+5 replica) index - we'll call it main

  • with ~130k documents at the moment for a 120MB size which, we update
    documents that were updated by our customers in our application with a cron
    than run every 5 minutes and updates at most 2k docs, we can have a few
    thousands docs in queue.
    We are also using logstash to log some user actions that our application
    relies on in monthly basis indices. Those indices have 2 shards (+2
    replica) with 1~6M docs for a size from 380MB to 1.5GB. At the moment, we
    have 11 log indices.
    We do search queries on both the main and the latest log indices.
    Marginally, some queries can occure on older log indices.
    Looking at our stats, I'd say we have a 2:1 indexing / searching ratio, but
    it can vary depending on seasonality.
    We also have a 1 shard (+1 replica) dedicated percolator index on which
    we're executing percolation queries before each log that will be indexed in
    ES through logstash.
    We never optimized any index.

Our issue :

Since we updated ES to v1.0.3 to deal with a field data breaker bug,
everything was running fine until we experienced a drastic CPU usage
increase(from near 100% to 200%) without any reason (no change in our
application nor on the traffic we got). No ES restart have been able to
give us back a normal CPU usage. In emergency, we decided to switch our
main node from 2 vCore CPU/ 16GB to 4 vCore CPU / 32 GB and the CPU usage
of the new node never went beyond 30% for almost 10 days. And then the
issue happened again, the CPU usage raised to 400% without any reason.

It is worth noting that the secondary node is not subject to this issue.

Our outsourcer told us this CPU increase was due to deadlocks caused by
malformed queries, but those malformed queries already happened before and
restarting ES didn't solve the high CPU usage.
He also told us our server had not enough resources and it would be better
to have 2 serveurs for MySQL master / slave and 2 to 3 distinct serveurs
for the ES cluster, which seems weird when we saw the main ES server had a
maximum 30% CPU usage for days.

We plan to update ES version to see if this issue is a bug that was already
solved, but are there other things we could try ? I wondered if our jvm
heap was enough since we have many data, that we use many filters in our
search queries and that we have more than 8GB unused memory on our main
node. Does the fact that our secondary node is not subject to this issue
says it's an indexing issue ?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/59e995ba-cf68-4fb6-848a-d7e4f0c8bb99%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Why so many replicas when you only have one data node? You won't even be
able to allocate them!
Your heap is also pretty small, 2GB is something you'd generally run on a
dev instance, I'd suggest going to 4GB if you can.

You need some monitoring around this to really put things into perspective.
Try installing a plugin like ElasticHQ and definitely look at Marvel as
well, you should also be monitoring the VMs on a system level to give you a
better idea on general resource usage.

Try updating to later versions of ES, there are always good improvements as
new releases come out so you may get a positive gain there. But without
understanding your resource use more it's hard to say,

On 5 December 2014 at 02:25, Dunaeth lomig.poyet@gmail.com wrote:

Hi,

We're running a two-nodes ES 1.0.3 cluster with the current setup :

VM on host A :
4 vCore CPU
32GB RAM
ES master (only node being queried)
MySQL slave (used as a backup, never queried)
JVM settings

/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Xms2g -Xmx2g -Xss256k
-Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.0.3.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.config=/etc/elasticsearch/elasticsearch.yml
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/home/log/elasticsearch
-Des.default.path.data=/home/elasticsearch
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch

VM on host B :
2 vCore CPU
16GB RAM
ES datanode (search are dispatched, no indexing)
MySQL master
JVM settings

/usr/lib/jvm/java-7-openjdk-amd64//bin/java -Xms2g -Xmx2g -Xss256k
-Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-XX:+HeapDumpOnOutOfMemoryError -Delasticsearch
-Des.pidfile=/var/run/elasticsearch.pid
-Des.path.home=/usr/share/elasticsearch -cp
:/usr/share/elasticsearch/lib/elasticsearch-1.0.3.jar:/usr/share/elasticsearch/lib/:/usr/share/elasticsearch/lib/sigar/
-Des.default.config=/etc/elasticsearch/elasticsearch.yml
-Des.default.path.home=/usr/share/elasticsearch
-Des.default.path.logs=/home/log/elasticsearch
-Des.default.path.data=/home/elasticsearch
-Des.default.path.work=/tmp/elasticsearch
-Des.default.path.conf=/etc/elasticsearch
org.elasticsearch.bootstrap.Elasticsearch


Before we got a 4 vCore CPU / 32GB VM, the master node was the same as the
secondary node.
On this cluster, we have a 5 shards (+5 replica) index - we'll call it
main - with ~130k documents at the moment for a 120MB size which, we update
documents that were updated by our customers in our application with a cron
than run every 5 minutes and updates at most 2k docs, we can have a few
thousands docs in queue.
We are also using logstash to log some user actions that our application
relies on in monthly basis indices. Those indices have 2 shards (+2
replica) with 1~6M docs for a size from 380MB to 1.5GB. At the moment, we
have 11 log indices.
We do search queries on both the main and the latest log indices.
Marginally, some queries can occure on older log indices.
Looking at our stats, I'd say we have a 2:1 indexing / searching ratio,
but it can vary depending on seasonality.
We also have a 1 shard (+1 replica) dedicated percolator index on which
we're executing percolation queries before each log that will be indexed in
ES through logstash.
We never optimized any index.

Our issue :

Since we updated ES to v1.0.3 to deal with a field data breaker bug,
everything was running fine until we experienced a drastic CPU usage
increase(from near 100% to 200%) without any reason (no change in our
application nor on the traffic we got). No ES restart have been able to
give us back a normal CPU usage. In emergency, we decided to switch our
main node from 2 vCore CPU/ 16GB to 4 vCore CPU / 32 GB and the CPU usage
of the new node never went beyond 30% for almost 10 days. And then the
issue happened again, the CPU usage raised to 400% without any reason.

It is worth noting that the secondary node is not subject to this issue.

Our outsourcer told us this CPU increase was due to deadlocks caused by
malformed queries, but those malformed queries already happened before and
restarting ES didn't solve the high CPU usage.
He also told us our server had not enough resources and it would be better
to have 2 serveurs for MySQL master / slave and 2 to 3 distinct serveurs
for the ES cluster, which seems weird when we saw the main ES server had a
maximum 30% CPU usage for days.

We plan to update ES version to see if this issue is a bug that was
already solved, but are there other things we could try ? I wondered if our
jvm heap was enough since we have many data, that we use many filters in
our search queries and that we have more than 8GB unused memory on our main
node. Does the fact that our secondary node is not subject to this issue
says it's an indexing issue ?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/59e995ba-cf68-4fb6-848a-d7e4f0c8bb99%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/59e995ba-cf68-4fb6-848a-d7e4f0c8bb99%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-Lv9xuQMaKMoh0rzntg8WNVbV8yQE7o4_BnL0cRVyu2g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Actualy the master node is also a datanode (si web have two datanodes), but just the only one that our application is aware of. We have several metrics on the VM, and our outsourcer may have metrics on the physical host. What's strange is that this ES setup ran without trouble for many months before this issue occured. From what I can see from our graphs, the only metric that could be significative is the JVM heap used as it often reach 80% of the assigned heap size and seems to be flushed then, it can occure at most every 1 or 2 hours.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d24c1d0e-d3c2-4ce4-9d65-9df422abd15e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

You still are overloaded with replicas, it's pointless having them there
and it keeps your cluster out of a green state.

On 5 December 2014 at 09:52, Dunaeth lomig.poyet@gmail.com wrote:

Actualy the master node is also a datanode (si web have two datanodes),
but just the only one that our application is aware of. We have several
metrics on the VM, and our outsourcer may have metrics on the physical
host. What's strange is that this ES setup ran without trouble for many
months before this issue occured. From what I can see from our graphs, the
only metric that could be significative is the JVM heap used as it often
reach 80% of the assigned heap size and seems to be flushed then, it can
occure at most every 1 or 2 hours.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d24c1d0e-d3c2-4ce4-9d65-9df422abd15e%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEYi1X9T4i0%2BCfbAH5QoC6xwzjZ7w15apd-aq9ctZAuY-R%2BV7g%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Actually,the cluster itself is in green state :

{
cluster_name: quaelead
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 2
active_primary_shards: 27
active_shards: 54
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Our main index may be oversized but it was designed a long time ago. Having
two nodes with two replicas for log indices just gives us a failover as
described
at http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_add_failover.html
I may have explained it wrong for the number of replicas, the replica
setting for each index is number_of_replicas : 1.

Le vendredi 5 décembre 2014 01:18:49 UTC+1, Mark Walkom a écrit :

You still are overloaded with replicas, it's pointless having them there
and it keeps your cluster out of a green state.

On 5 December 2014 at 09:52, Dunaeth <lomig...@gmail.com <javascript:>>
wrote:

Actualy the master node is also a datanode (si web have two datanodes),
but just the only one that our application is aware of. We have several
metrics on the VM, and our outsourcer may have metrics on the physical
host. What's strange is that this ES setup ran without trouble for many
months before this issue occured. From what I can see from our graphs, the
only metric that could be significative is the JVM heap used as it often
reach 80% of the assigned heap size and seems to be flushed then, it can
occure at most every 1 or 2 hours.

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearc...@googlegroups.com <javascript:>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d24c1d0e-d3c2-4ce4-9d65-9df422abd15e%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3bd755b0-51c8-48b5-8de4-ee626ccb9ce1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Our outsourcer upgraded the master node, HEAP is now at 8GB and the VM has
now 8 vCores. Restarting the VM brought back the CPU usage to less than
30%, so it seems a full VM restart is the only way to stop this CPU usage
issue where an ES service restart doesn't solve anything.

Le vendredi 5 décembre 2014 09:16:22 UTC+1, Dunaeth a écrit :

Actually,the cluster itself is in green state :

{
cluster_name: quaelead
status: green
timed_out: false
number_of_nodes: 2
number_of_data_nodes: 2
active_primary_shards: 27
active_shards: 54
relocating_shards: 0
initializing_shards: 0
unassigned_shards: 0
}

Our main index may be oversized but it was designed a long time ago.
Having two nodes with two replicas for log indices just gives us a failover
as described at
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_add_failover.html
I may have explained it wrong for the number of replicas, the replica
setting for each index is number_of_replicas : 1.

Le vendredi 5 décembre 2014 01:18:49 UTC+1, Mark Walkom a écrit :

You still are overloaded with replicas, it's pointless having them there
and it keeps your cluster out of a green state.

On 5 December 2014 at 09:52, Dunaeth lomig...@gmail.com wrote:

Actualy the master node is also a datanode (si web have two datanodes),
but just the only one that our application is aware of. We have several
metrics on the VM, and our outsourcer may have metrics on the physical
host. What's strange is that this ES setup ran without trouble for many
months before this issue occured. From what I can see from our graphs, the
only metric that could be significative is the JVM heap used as it often
reach 80% of the assigned heap size and seems to be flushed then, it can
occure at most every 1 or 2 hours.

--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/d24c1d0e-d3c2-4ce4-9d65-9df422abd15e%40googlegroups.com
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7d9ec54a-2cc2-4a05-8795-a26aaba836eb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.