Heavy load on a small Elasticsearch cluster

Mathieu_Lecarme · October 24, 2014, 4:30pm

A simple cluster, 2 nodes, replica 1. Each node has 1.5Go RAM, 2 cores, SAS
disks.
With Elasticsearch 1.1 some deconnection appears, and some CPU load picks.
Bad usage of logstash (lots of tiny bulk imports, with monthly indices).
Logstash usage was fixed, and Elasticsearch upgraded to 1.3. 1.3.3, then
1.3.4, 10 minutes after.
The CPU usage is now 100% (so one core used), LOTS of file descriptor
opened, and memory usage is growing. RAM is upgraded to 2Go.

Strace show that 5 threads use lots of CPU and 1 thread does 7000 stat()/s.

Elasticsearch Hot thread show lots of FSDirectory.listAll. Disk usage is
low, just a lots of stats.

The shard is set to 9, and logstash opens lots of indices, 2286 shards for
7GB, 37487 files in the indices folder.

In the recovery API, everything is "done" with strange percent score, all
shards have "replica" states.

Now, the load makes heavy waves, slowing the service.

This is just a long migration from different version of Lucene (from ES 1.1
to 1.3), a misconfiguration, a real bug, or am I just doomed?

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/2e7ad3ba-a1f6-44c8-b9e5-67b0c1ed8bc9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

jprante · October 24, 2014, 5:58pm

You're doomed

What monitoring tool do you use? Try to reduce the frequency at least.

Jörg

On Fri, Oct 24, 2014 at 6:30 PM, Mathieu Lecarme mathieu.lecarme@gmail.com
wrote:

A simple cluster, 2 nodes, replica 1. Each node has 1.5Go RAM, 2 cores,
SAS disks.
With Elasticsearch 1.1 some deconnection appears, and some CPU load picks.
Bad usage of logstash (lots of tiny bulk imports, with monthly indices).
Logstash usage was fixed, and Elasticsearch upgraded to 1.3. 1.3.3, then
1.3.4, 10 minutes after.
The CPU usage is now 100% (so one core used), LOTS of file descriptor
opened, and memory usage is growing. RAM is upgraded to 2Go.

Strace show that 5 threads use lots of CPU and 1 thread does 7000 stat()/s.

Elasticsearch Hot thread show lots of FSDirectory.listAll. Disk usage is
low, just a lots of stats.

The shard is set to 9, and logstash opens lots of indices, 2286 shards for
7GB, 37487 files in the indices folder.

In the recovery API, everything is "done" with strange percent score, all
shards have "replica" states.

Now, the load makes heavy waves, slowing the service.

This is just a long migration from different version of Lucene (from ES
1.1 to 1.3), a misconfiguration, a real bug, or am I just doomed?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/2e7ad3ba-a1f6-44c8-b9e5-67b0c1ed8bc9%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/2e7ad3ba-a1f6-44c8-b9e5-67b0c1ed8bc9%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEF0O7TCaJEExX99bcXgQzFB6YvRjvmrywWeWA4VPQuSg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Mathieu_Lecarme · October 24, 2014, 7:30pm

Le vendredi 24 octobre 2014 19:59:01 UTC+2, Jörg Prante a écrit :

You're doomed

What monitoring tool do you use? Try to reduce the frequency at least.

Jörg

New Relic monitor the OS but don't touch ES.
I used a patched version of Diamond. I unplugged it and I RTFM for specific
frequency setup.
Some browsers with kopf runnning.

I can watch user agent, to know who is hurting the server.

This tools was used for months, without breaking anything. I'm still
suspicious about recovery status.

M.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0c979b7b-262b-4fdb-8a65-03396824c945%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mathieu_Lecarme · October 25, 2014, 6:05pm

What monitoring tool do you use? Try to reduce the frequency at least.

The _nodes/stats?all url is VERY slow for an elasticsearch request,
something like 1500 ms. Some tools like kopf poll it every 3 seconds. If
your tool poll it too every minute, you can break something. The
_nodes/stats/indices is the slowest sub part, and the most interesting sub
part.

Is it a regression in ES 1.3 ? I can't find any 1.1 ES in my network to
bench the difference.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/62d15231-5ca6-49e1-90f9-f87b9ff76978%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

otisg · October 26, 2014, 5:47am

Hi,

On Friday, October 24, 2014 3:30:01 PM UTC-4, Mathieu Lecarme wrote:

Le vendredi 24 octobre 2014 19:59:01 UTC+2, Jörg Prante a écrit :

You're doomed

What monitoring tool do you use? Try to reduce the frequency at least.

Jörg

New Relic monitor the OS but don't touch ES.

May want to look at SPM http://sematext.com/spm/ for both.

For now, maybe look at thread dump (I'd actually create a few of them and
compare them), GC, merges stats...

Otis

Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

I used a patched version of Diamond. I unplugged it and I RTFM for specific

frequency setup.
Some browsers with kopf runnning.

I can watch user agent, to know who is hurting the server.

This tools was used for months, without breaking anything. I'm still
suspicious about recovery status.

M.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f324b8b6-82c7-4e97-8dca-6d83290efd72%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Topic		Replies	Views
Cluster CPU usage Elasticsearch	1	314	July 6, 2017
High CPU and load since 1.4.1 -> 1.4.2 update Elasticsearch	3	658	July 6, 2017
High system load on one node in three node cluster Elasticsearch	7	1881	August 3, 2020
CPU High load Elasticsearch	8	2627	January 18, 2018
ElasticSearch memory usage on centralized log clusters Elasticsearch	4	918	July 6, 2017

Heavy load on a small Elasticsearch cluster

Otis

Related topics