High CPU and load since 1.4.1 -> 1.4.2 update

Hi all,

We have a 4 node VM dev cluster with 3 data nodes and 1 master node. The 3
data nodes are taking lots of CPU and the load average is high when the
servers are sitting idle with no accesses or indexing.

This seemed to be fine before we updated to 1.4.2 before the new year, but
I can't confirm this is definitely the reason.

The hot threads from each node seem to indicate something to do with the
filesystem most of the time e.g.

87.7% (438.6ms out of 500ms) cpu usage by thread 'elasticsearch[potassium][management][T#4]'
2/10 snapshots sharing following 21 elements
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:223)
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:242)
org.apache.lucene.store.FileSwitchDirectory.listAll(FileSwitchDirectory.java:87)
org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
org.elasticsearch.index.store.DistributorDirectory.listAll(DistributorDirectory.java:88)
org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
org.elasticsearch.common.lucene.Directories.estimateSize(Directories.java:40)
org.elasticsearch.index.store.Store.stats(Store.java:216)
org.elasticsearch.index.shard.service.InternalIndexShard.storeStats(InternalIndexShard.java:540)
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:134)
org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
org.elasticsearch.node.service.NodeService.stats(NodeService.java:156)
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:96)
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:44)
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:278)
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:269)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Please also see the image attachment for the high CPU and load.

The VMs are on ESXi 5.5 on 2 * 4 core HT Xeon E5530 so server hardware is pretty heavy. The nodes are set up with 16GB RAM, 2 vCPUs, all ES defaults apart from:

[root@potassium ~]# grep -v ^# /etc/elasticsearch/elasticsearch.yml | grep -v ^$
cluster.name: es-dev
node.name: "potassium"
node.master: false
node.data: true
path.data: /data/elasticsearch
marvel.agent.exporter.es.hosts: ["hydrogen:9200"]

and

[root@potassium ~]# grep -v ^# /etc/sysconfig/elasticsearch | grep -v ^$
ES_HOME=/usr/share/elasticsearch
ES_HEAP_SIZE=6g
MAX_OPEN_FILES=65535
MAX_MAP_COUNT=262144
LOG_DIR=/var/log/elasticsearch
DATA_DIR=/var/lib/elasticsearch
WORK_DIR=/tmp/elasticsearch
CONF_DIR=/etc/elasticsearch
CONF_FILE=/etc/elasticsearch/elasticsearch.yml
ES_USER=elasticsearch

So far I've tried:

  • Dropping all data and loading again using logstash
  • Deleting XFS filesystem and changing to ext4
  • Removing all plugins
  • Leaving for about 2 weeks in case it was doing background optimisation
  • and various other things

Does anyone have any suggestions about where I should look next, or any thoughts about what could be happening? Please let me know if I can pull any other info off the nodes or cluster to help diagnose.

Many thanks,
Darren

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a549c657-7bff-4157-893a-8972a2a0e140%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Hi,,

The only things I see is ES_HEAP_SIZE with 50% this can bet set to 8gb.
I'm seeing some swap usage, you could disable swap completely.

hth,,
A.

On Monday, January 5, 2015 12:48:53 PM UTC+1, Darren Mansell wrote:

Hi all,

We have a 4 node VM dev cluster with 3 data nodes and 1 master node. The 3
data nodes are taking lots of CPU and the load average is high when the
servers are sitting idle with no accesses or indexing.

This seemed to be fine before we updated to 1.4.2 before the new year, but
I can't confirm this is definitely the reason.

The hot threads from each node seem to indicate something to do with the
filesystem most of the time e.g.

87.7% (438.6ms out of 500ms) cpu usage by thread 'elasticsearch[potassium][management][T#4]'
2/10 snapshots sharing following 21 elements
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:223)
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:242)
org.apache.lucene.store.FileSwitchDirectory.listAll(FileSwitchDirectory.java:87)
org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
org.elasticsearch.index.store.DistributorDirectory.listAll(DistributorDirectory.java:88)
org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
org.elasticsearch.common.lucene.Directories.estimateSize(Directories.java:40)
org.elasticsearch.index.store.Store.stats(Store.java:216)
org.elasticsearch.index.shard.service.InternalIndexShard.storeStats(InternalIndexShard.java:540)
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:134)
org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
org.elasticsearch.node.service.NodeService.stats(NodeService.java:156)
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:96)
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:44)
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:278)
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:269)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Please also see the image attachment for the high CPU and load.

The VMs are on ESXi 5.5 on 2 * 4 core HT Xeon E5530 so server hardware is pretty heavy. The nodes are set up with 16GB RAM, 2 vCPUs, all ES defaults apart from:

[root@potassium ~]# grep -v ^# /etc/elasticsearch/elasticsearch.yml | grep -v ^$cluster.name: es-devnode.name: "potassium"
node.master: false
node.data: true
path.data: /data/elasticsearch
marvel.agent.exporter.es.hosts: ["hydrogen:9200"]

and

[root@potassium ~]# grep -v ^# /etc/sysconfig/elasticsearch | grep -v ^$
ES_HOME=/usr/share/elasticsearch
ES_HEAP_SIZE=6g
MAX_OPEN_FILES=65535
MAX_MAP_COUNT=262144
LOG_DIR=/var/log/elasticsearch
DATA_DIR=/var/lib/elasticsearch
WORK_DIR=/tmp/elasticsearch
CONF_DIR=/etc/elasticsearch
CONF_FILE=/etc/elasticsearch/elasticsearch.yml
ES_USER=elasticsearch

So far I've tried:

  • Dropping all data and loading again using logstash
  • Deleting XFS filesystem and changing to ext4
  • Removing all plugins
  • Leaving for about 2 weeks in case it was doing background optimisation
  • and various other things

Does anyone have any suggestions about where I should look next, or any thoughts about what could be happening? Please let me know if I can pull any other info off the nodes or cluster to help diagnose.

Many thanks,
Darren

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/9447a9cc-f65b-4b46-8a0c-35002368c889%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

To update this, I've just found I can resolve the problem by stopping just
the master node. Instantly the other data nodes settle down into single
figure CPU usage and <1 load.

The layout is:

caesium:
node.master: true
node.data: false

sodium, potassium, francium:
node.master: false
node.data: true

If I allow any of the data nodes to become master and then stop caesium,
it's fine. Is there a problem with ES when operating in this way? This was
the recommended way of working from the ES training course I attended, so I
implemented it as soon as I got back.

Regards,
Darren

On Monday, 5 January 2015 11:48:53 UTC, Darren Mansell wrote:

Hi all,

We have a 4 node VM dev cluster with 3 data nodes and 1 master node. The 3
data nodes are taking lots of CPU and the load average is high when the
servers are sitting idle with no accesses or indexing.

This seemed to be fine before we updated to 1.4.2 before the new year, but
I can't confirm this is definitely the reason.

The hot threads from each node seem to indicate something to do with the
filesystem most of the time e.g.

87.7% (438.6ms out of 500ms) cpu usage by thread 'elasticsearch[potassium][management][T#4]'
2/10 snapshots sharing following 21 elements
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:223)
org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:242)
org.apache.lucene.store.FileSwitchDirectory.listAll(FileSwitchDirectory.java:87)
org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
org.elasticsearch.index.store.DistributorDirectory.listAll(DistributorDirectory.java:88)
org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:48)
org.elasticsearch.common.lucene.Directories.estimateSize(Directories.java:40)
org.elasticsearch.index.store.Store.stats(Store.java:216)
org.elasticsearch.index.shard.service.InternalIndexShard.storeStats(InternalIndexShard.java:540)
org.elasticsearch.action.admin.indices.stats.CommonStats.(CommonStats.java:134)
org.elasticsearch.action.admin.indices.stats.ShardStats.(ShardStats.java:49)
org.elasticsearch.indices.InternalIndicesService.stats(InternalIndicesService.java:212)
org.elasticsearch.node.service.NodeService.stats(NodeService.java:156)
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:96)
org.elasticsearch.action.admin.cluster.node.stats.TransportNodesStatsAction.nodeOperation(TransportNodesStatsAction.java:44)
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:278)
org.elasticsearch.action.support.nodes.TransportNodesOperationAction$NodeTransportHandler.messageReceived(TransportNodesOperationAction.java:269)
org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
java.lang.Thread.run(Thread.java:745)

Please also see the image attachment for the high CPU and load.

The VMs are on ESXi 5.5 on 2 * 4 core HT Xeon E5530 so server hardware is pretty heavy. The nodes are set up with 16GB RAM, 2 vCPUs, all ES defaults apart from:

[root@potassium ~]# grep -v ^# /etc/elasticsearch/elasticsearch.yml | grep -v ^$cluster.name: es-devnode.name: "potassium"
node.master: false
node.data: true
path.data: /data/elasticsearch
marvel.agent.exporter.es.hosts: ["hydrogen:9200"]

and

[root@potassium ~]# grep -v ^# /etc/sysconfig/elasticsearch | grep -v ^$
ES_HOME=/usr/share/elasticsearch
ES_HEAP_SIZE=6g
MAX_OPEN_FILES=65535
MAX_MAP_COUNT=262144
LOG_DIR=/var/log/elasticsearch
DATA_DIR=/var/lib/elasticsearch
WORK_DIR=/tmp/elasticsearch
CONF_DIR=/etc/elasticsearch
CONF_FILE=/etc/elasticsearch/elasticsearch.yml
ES_USER=elasticsearch

So far I've tried:

  • Dropping all data and loading again using logstash
  • Deleting XFS filesystem and changing to ext4
  • Removing all plugins
  • Leaving for about 2 weeks in case it was doing background optimisation
  • and various other things

Does anyone have any suggestions about where I should look next, or any thoughts about what could be happening? Please let me know if I can pull any other info off the nodes or cluster to help diagnose.

Many thanks,
Darren

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/27cf473b-8cb3-4bf1-b778-b2fbab9490fe%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.