High CPU usage while bulk indexing

We have a 9 box setup (details provided in the Environment section further down).

We have a job that uses the bulk write interface to write 100 documents at a time to Elastic. Each document may be up to 1MB in size but are generally approx 100 - 200k. An instance of this job runs on each server.

During heavy indexing we see the CPU hit (and remain around) 500% and our throughput fall to be approx 3 documents indexed / second. I have run hotthreads but I can upload the full text as it makes the post too long and I can only upload images. Heres a snippet:

103.2% (516.2ms out of 500ms) cpu usage by thread 'elasticsearch[server3-3][management][T#5]'
     3/10 snapshots sharing following 31 elements
       sun.nio.fs.UnixNativeDispatcher.stat0(Native Method)
       sun.nio.fs.UnixNativeDispatcher.stat(UnixNativeDispatcher.java:286)
       sun.nio.fs.UnixFileAttributes.get(UnixFileAttributes.java:70)

103.2% (516.2ms out of 500ms) cpu usage by thread 'elasticsearch[server3-3][management][T#4]'
     3/10 snapshots sharing following 15 elements
       org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:723)
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)
    org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:163)

103.2% (516.1ms out of 500ms) cpu usage by thread 'elasticsearch[server3-3][management][T#2]'
     2/10 snapshots sharing following 31 elements
       java.lang.SecurityManager.checkPermission(SecurityManager.java:549)
       java.lang.SecurityManager.checkRead(SecurityManager.java:888)
       sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)

103.2% (516ms out of 500ms) cpu usage by thread 'elasticsearch[server3-3][management][T#3]'
     4/10 snapshots sharing following 38 elements
       java.io.UnixFileSystem.canonicalize0(Native Method)
       java.io.UnixFileSystem.canonicalize(UnixFileSystem.java:172)
       java.io.File.getCanonicalPath(File.java:618)

   101.7% (508.3ms out of 500ms) cpu usage by thread 'elasticsearch[server3-3][management][T#1]'
     3/10 snapshots sharing following 16 elements
       org.elasticsearch.index.engine.Engine.segmentsStats(Engine.java:539)
       org.elasticsearch.index.shard.IndexShard.segmentStats(IndexShard.java:723)
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:210)

I have seen posts on here with similar hot thread traces but I didn't see an answer explaining the behvaiour. As mentioned we are only performing indexing (no searches). We are not using the completion suggestor.

While this is happening we see request timeout errors on nodes talking to each other:

[2018-06-07T13:40:42,983][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [16113ms] ago, timed out [1113ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server2-1}{fEEZg21jSi63Q7Ki_4A9zA}{mjKn7-fVQgyB5MPqkunRNQ}{server2-1}{IPADDRESS:9300}], id [32784096]

We also see excessive GC times (although the memory is generally reclaimed):

[2018-06-07T14:41:51,825][WARN ][o.e.m.j.JvmGcMonitorService] [server3-1] [gc][old][496267][101] duration [14.3s], collections [1]/[15.1s], total [14.3s]/[1m], memory [22gb]->[8.9gb]/[29.8gb], all_pools {[young] [374.4mb]->[5.6mb]/[865.3mb]}{[survivor] [108.1mb]->[0b]/[108.1mb]}{[old] [21.5gb]->[8.9gb]/[28.9gb]}
[2018-06-07T14:41:51,825][WARN ][o.e.m.j.JvmGcMonitorService] [server3-1] [gc][496267] overhead, spent [14.4s] collecting in the last [15.1s]

After about 4 weeks or running we get OOM on a node (shortly followed by OOM on other nodes).

Our refresh interval is set to 60 seconds.

Using top I can see the CPU is consumed by the user column (and not sys or iowait for example).

Schema:
Schema is available on request but due to size limits I could not upload it here.

Environment details:
Elasticsearch version (bin/elasticsearch --version):
Version: 5.5.1, Build: 19c13d0/2017-07-18T20:44:24.823Z, JVM: 1.8.0_141

JVM version (java -version):
openjdk version "1.8.0_141"
OpenJDK Runtime Environment (build 1.8.0_141-b16)
OpenJDK 64-Bit Server VM (build 25.141-b16, mixed mode)

OS version (uname -a if on a Unix-like system):
Linux kyc-demo 2.6.32-696.10.1.el6.x86_64 #1 SMP Tue Aug 22 18:51:35 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

We have 9 servers each with 132G RAM. We allocate 30G to the Elastic VM. There are other processes on this box also. Each box has 16 CPUs. The CPUs are generally under utilised.

We use Spinny disks.

We have approximately 300 date based indices with a total of 1550 primary shards with 1 replica for each shard (so 3000 shards in total).

Any help will be much appreciated.

This is currently happening again for us. All 9 of our servers are running between 500 to 600% CPU. All of this being usage is attributable to Elastic. IO wait is sitting at low percentages.

Its also worth mentioning that we are not currently performing lots of bulk updates at this time (1 or 2 documents a second).

Hot Threads shows similar stack traces as before (mentions of Stats being run). While this is happening we see lots of timeouts to the cluster status requests:

[2018-06-14T09:03:26,051][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [100319ms] ago, timed out [85319ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server2-3}{bU_fxjk_SliE0DTSerCdsA}{cURIg1nxRa6rwCaSj4VrZg}{server2-3}{IPADDRESS:9300}], id [44075373]
[2018-06-14T09:03:42,109][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [116377ms] ago, timed out [101377ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server3-3}{-_KLxepPTS
WUyXV7K0Phcg}{IJOuYjegS3agAHj8oHaQEQ}{server3-3}{IPADDRESS:9300}], id [44075367]
[2018-06-14T09:03:42,288][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [116556ms] ago, timed out [101556ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server3-2}{uToW7PcsTLe6TtmP639JDw}{-w6Qc-tWQaStJhODASBfnA}{server3-2}{IPADDRESS:9300}], id [44075372]
[2018-06-14T09:03:51,646][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [125914ms] ago, timed out [110914ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server1-3}{kCkNOqdFTeuWmMExF_P1Pg}{sWNjOCMOT3eKIl3_YrlK3A}{server1-3}{IPADDRESS:9300}], id [44075368]
[2018-06-14T09:03:52,598][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [126866ms] ago, timed out [111866ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server2-2}{urTijqtcTG-5_Fp4WlcJBQ}{53ZgpuNSSaemcvlAfEZcgg}{server2-2}{IPADDRESS:9300}], id [44075374]
[2018-06-14T09:03:53,310][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [127578ms] ago, timed out [112578ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server1-2}{rTxbE6QrTT
GoiEv1pFr2RQ}{LpquMKBMQqOD9vmLuJDB4Q}{server1-2}{IPADDRESS:9300}], id [44075366]
[2018-06-14T09:03:55,036][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [129304ms] ago, timed out [114304ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server1-1}{_ZO-VqQVRh
esENAmpEVJdQ}{0lYAAegQR526jEcHe0MGcg}{server1-1}{IPADDRESS:9300}], id [44075370]
[2018-06-14T09:04:33,965][WARN ][o.e.t.TransportService   ] [server3-1] Received response for a request that has timed out, sent [168233ms] ago, timed out [153233ms] ago, action [cluster:monitor/nodes/stats[n]], node [{server3-1}{NQkt0_xpRRu-qKTf_VpbpA}{V078-EMtQz2w89ihAnMz0w}{server3-1}{IPADDRESS:9300}], id [44075369]

After my previous investigations we have re-enabled Datadog (in case this is significant?).

Any thoughts on what is going on? Is this a bug?

This happened again yesterday and took the cluster out.

This now feels like a bad issue. Should I raise it over at github?

Gitissue raised at https://github.com/elastic/elasticsearch/issues/31830

Are you using static or dynamic mappings? How many fields do you have in your mappings? Are these constantly being added to, either through new fields and/or types?

Christian,

Thanks a lot for getting back to me. We use static mappings. I've added my mapping file to the ticket on the git issue (https://github.com/elastic/elasticsearch/issues/31830) as I cant add it in here due to its length.

Please let me know if you need anything else.

Are you (or an app/script) calling stats endpoints very frequently? The various stats endpoints (index stats, node stats, etc) are fairly heavy API calls since they have to talk to all the nodes, collect OS and ES stats and the compile the results for the user.

Looking at the timestamps on those monitor/nodes/stats timeouts, they are showing up every second or so, making me think there is a process polling stats at least once per second.

I've seen clusters brought to their knees by errant scripts calling stats too frequently (several times per second, once per second, etc). It's made worse by clusters with more nodes (more machines to talk to) or more indices (more shards to compile stats for).

Can you verify there isn't a script or service that's repeatedly hitting the stats? I'm not sure what frequency Datadog polls at, but that may be related.

1 Like

Another potentially useful datapoint is to look at the Pending Tasks API to see if there is a large backlog of tasks for the master.

Re: GC I'm not sure what's going on there, I would be surprised if that's caused by the stats APIs. Might be a symptom of something else, potentially made worse by the stats API situation (if that turns out to be true)

Thanks again for the responses.

I can confirm the dog is polling once every second. We did manage to see these issues without datadog running though so I suspect its a red herring. Anyway, I have reduced it down to poll once every 30 seconds.

When it happens again i'll capture the hotthreads and look at the task backlogs on the nodes.

We had also tried calling the _cluster/health endpont in local=true to reduce the load but again this didnt seem to help.

Ok, having reduced datadog to polling once every 30 seconds we are still seeing this issue (it happened again both last night and today).

running:

curl http://localhost:9200/_cluster/pending_tasks 

returns:
{"tasks":[]}

running:

curl http://localhost:9200/_tasks 

hangs (no response is seen after 10 minutes).

A truncated hot threads is shown below (this heavy usage happens across our servers although only one is shown in the truncated snippet below).

Hot threads at 2018-07-24T07:41:05.178Z, interval=500ms, busiestThreads=20, ignoreIdleThreads=true:
   
   100.4% (502.1ms out of 500ms) cpu usage by thread 'elasticsearch[server1-1][management][T#1]'
     3/10 snapshots sharing following 19 elements
       java.util.TreeMap.getEntry(TreeMap.java:359)
       java.util.TreeMap.get(TreeMap.java:278)
       org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsReader.terms(PerFieldPostingsFormat.java:315)
       org.elasticsearch.search.suggest.completion.CompletionFieldStats.completionStats(CompletionFieldStats.java:54)
       org.elasticsearch.index.shard.IndexShard.completionStats(IndexShard.java:743)
       org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:207)
       org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:163)
       org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
       ...
     2/10 snapshots sharing following 38 elements
       java.io.UnixFileSystem.canonicalize0(Native Method)
       java.io.UnixFileSystem.canonicalize(UnixFileSystem.java:172)
       java.io.File.getCanonicalPath(File.java:618)
       java.io.FilePermission$1.run(FilePermission.java:215)
       java.io.FilePermission$1.run(FilePermission.java:203)
       java.security.AccessController.doPrivileged(Native Method)
       java.io.FilePermission.init(FilePermission.java:203)
       java.io.FilePermission.<init>(FilePermission.java:277)
       java.lang.SecurityManager.checkRead(SecurityManager.java:888)
       sun.nio.fs.UnixPath.checkRead(UnixPath.java:795)
       sun.nio.fs.UnixFileAttributeViews$Basic.readAttributes(UnixFileAttributeViews.java:49)
       sun.nio.fs.UnixFileSystemProvider.readAttributes(UnixFileSystemProvider.java:144)
       sun.nio.fs.LinuxFileSystemProvider.readAttributes(LinuxFileSystemProvider.java:99)
       java.nio.file.Files.readAttributes(Files.java:1737)
       java.nio.file.Files.size(Files.java:2332)
       org.apache.lucene.store.FSDirectory.fileLength(FSDirectory.java:243)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       org.apache.lucene.store.FilterDirectory.fileLength(FilterDirectory.java:67)
       ....
     5/10 snapshots sharing following 13 elements
       org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:163)
       org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
       org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:433)
       org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:412)
       org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:399)
       org.elasticsearch.transport.TransportRequestHandler.messageReceived(TransportRequestHandler.java:33)
       org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:69)
       org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1544)
       org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638)
       org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
       java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
       java.lang.Thread.run(Thread.java:748)

Ok tasks finally came back. Truncated snippet is shown below:

{"nodes":{"_ZO-VqQVRhesENAmpEVJdQ":{"name":"server1-1","transport_address":"IPADDRESS:9300","host":"server1-1","ip":"IPADDRESS:9300","roles":["master","data","ingest"],"tasks":{"_ZO-VqQVRhesENAmpEVJdQ:37443641":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37443641,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532418012199,"running_time_in_nanos":1084358946611,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37434936":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37434936,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417889928,"running_time_in_nanos":1206629758069,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37439039":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37439039,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417935976,"running_time_in_nanos":1160582011231,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37439037":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37439037,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417935940,"running_time_in_nanos":1160617393188,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37441084":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37441084,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417966623,"running_time_in_nanos":1129934208458,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37433395":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37433395,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417868925,"running_time_in_nanos":1227632425794,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37443123":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37443123,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532418002166,"running_time_in_nanos":1094391500759,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37442102":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37442102,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417979659,"running_time_in_nanos":1116898817353,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37432886":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37432886,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417861938,"running_time_in_nanos":1234619336998,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37443124":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37443124,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532418002170,"running_time_in_nanos":1094387987182,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37441070":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37441070,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417966401,"running_time_in_nanos":1130156245887,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37436462":{"node":"_ZO-VqQVRhesENAmpEVJdQ","id":37436462,"type":"transport","action":"indices:monitor/stats","start_time_in_millis":1532417907356,"running_time_in_nanos":1189201824605,"cancellable":false},"_ZO-VqQVRhesENAmpEVJdQ:37433900":{"node":"_ZO-

Any thoughts on this as it keeps killing our Elastic server.

Ta muchly

This is happening again.

Is this just something we have to live with?

The lack of response is very frustrating on here. I raised the discussion, waited 16 days. Raised a git tissue and received an instant response.

Since that initial response I have provided information requested and now again another month passes without any feedback. It seems we only get responses for git tickets.

We arent doing anything unusual here. This just feels broke.

3 Likes