High disk read on one node out of 3

Pavel_Prokopenko · September 14, 2018, 11:39am

Hi!
I observe very high disk read rate on one of our elaticsearch 5.5.2 nodes. Other 2 nodes do not have such problem.
iotop tool confirms that the disk reads are actually done by the elasticsearch threads:

  TID  PRIO  USER    DISK READ>  DISK WRITE  SWAPIN      IO    COMMAND                                                                                   
 8095 be/4 elastics     40.73 G   1732.88 M  0.00 % 12.71 % java -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX~csearch -Edefault.path.conf=/etc/elasticsearch
 8061 be/4 elastics     40.51 G   1722.35 M  0.00 % 12.72 % java -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX~csearch -Edefault.path.conf=/etc/elasticsearch
 8333 be/4 elastics      2.65 G      0.00 B  0.00 %  0.54 % java -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX~csearch -Edefault.path.conf=/etc/elasticsearch
 8026 be/4 elastics      2.45 G      8.00 K  0.00 %  0.53 % java -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX~csearch -Edefault.path.conf=/etc/elasticsearch
 8236 be/4 elastics      2.38 G      8.00 K  0.00 %  0.49 % java -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX~csearch -Edefault.path.conf=/etc/elasticsearch
 8112 be/4 elastics      2.25 G      0.00 B  0.00 %  0.50 % java -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX~csearch -Edefault.path.conf=/etc/elasticsearch
 1037 be/4 root          2.18 G     66.57 M  0.00 %  1.52 % snmpd -LS0-6d -f
 8316 be/4 elastics   1224.79 M    764.70 M  0.00 %  0.22 % java -Xms4g -Xmx4g -XX:+UseConcMarkSweepGC -XX~csearch -Edefault.path.conf=/etc/elasticsearch
 3129 be/4 mysql       919.16 M      3.42 M  0.00 %  0.06 % mysqld --wsrep_start_position=36f88a88-8d60-11e7-8399-ea05333056b4:3260959

Searching the forum for similar problem, I have found the following possible reasons:

Refresh
Merge
ID lookup

I have tried tried to check these areas, but still can't say I have found out anything useful.
Is there any method to find what is being so intensively read by elasticsearch?

Tried to look at the hot_threads output for this particular node.
::: {esmaas02}{z9_QEOfUTFCpsHwyagYkww}{1hOUnwP7Q1etdmcSqc9N2g}{10.212.43.47}{10.212.43.47:9300}{box_type=hot}
Hot threads at 2018-09-14T11:28:00.099Z, interval=1s, busiestThreads=3, ignoreIdleThreads=true:

    2.2% (22.4ms out of 1s) cpu usage by thread 'elasticsearch[esmaas02][bulk][T#1]'
     2/10 snapshots sharing following 2 elements
       java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
       java.lang.Thread.run(Thread.java:748)

One more observation: the reads drop down at 03:00 (the time, when the new indexes are being created). So, the problem possibly has some relation to the index size..

I have also tried to track particular indexes with primary shard located at the problem node, but no results.
So, the questions are:

How to find what files are being read from disk?
Why the disk reads are so high only on this one node?

Thank you!

Addition: ran more times the hot_threads and it looks like the reads are most probably caused by the id lookups:

10.0% (100.4ms out of 1s) cpu usage by thread 'elasticsearch[esmaas02][bulk][T#2]'
     10/10 snapshots sharing following 42 elements
       java.nio.Bits.copyToArray(Bits.java:834)
       java.nio.DirectByteBuffer.get(DirectByteBuffer.java:280)
       org.apache.lucene.store.ByteBufferGuard.getBytes(ByteBufferGuard.java:93)
       org.apache.lucene.store.ByteBufferIndexInput.readBytes(ByteBufferIndexInput.java:89)
       org.apache.lucene.codecs.blocktree.SegmentTermsEnumFrame.loadBlock(SegmentTermsEnumFrame.java:176)
       org.apache.lucene.codecs.blocktree.SegmentTermsEnum.seekExact(SegmentTermsEnum.java:507)
       org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.getDocID(PerThreadIDAndVersionLookup.java:102)
       org.elasticsearch.common.lucene.uid.PerThreadIDAndVersionLookup.lookupVersion(PerThreadIDAndVersionLookup.java:88)
       org.elasticsearch.common.lucene.uid.VersionsResolver.loadDocIdAndVersion(VersionsResolver.java:122)
       org.elasticsearch.common.lucene.uid.VersionsResolver.loadVersion(VersionsResolver.java:137)
       org.elasticsearch.index.engine.InternalEngine.loadCurrentVersionFromIndex(InternalEngine.java:1374)
       org.elasticsearch.index.engine.InternalEngine.resolveDocVersion(InternalEngine.java:394)
       org.elasticsearch.index.engine.InternalEngine.planIndexingAsPrimary(InternalEngine.java:571)
       org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:494)
       org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:556)
       org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:545)
       org.elasticsearch.action.bulk.TransportShardBulkAction.executeIndexRequestOnPrimary(TransportShardBulkAction.java:484)
       org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:143)

And here is one more graph (more detailed):

Someone said that the problem might be caused by memory fragmentation. So, here is some memory related info

[root@esmaas01]# free -h
              total        used        free      shared  buff/cache   available
Mem:            11G        8.5G        495M        583M        2.6G        2.0G
Swap:            0B          0B          0B
[root@esmaas01]# 

[root@esmaas02]# free -h
              total        used        free      shared  buff/cache   available
Mem:            11G         10G        143M        600M        1.3G        419M
Swap:            0B          0B          0B
[root@esmaas02]# 

[root@esmaas03]# free -h
              total        used        free      shared  buff/cache   available
Mem:            11G        8.4G        159M        608M        3.0G        2.1G
Swap:            0B          0B          0B
[root@esmaas03]#

Pavel_Prokopenko · September 18, 2018, 11:47am

More observations: If I reboot the machine (esmaas02) - I do not see the problem for some time (2-4 days)..
Also, as you can see from the above post, the problem machine (esmaas02) has more memory used (and less left for the cache).
Could it be that the elasticsearch process starts massive disk reads when the cache is getting low?

system · October 16, 2018, 11:47am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Elasticsearch 7.1 have rediculously high disk read io (100%) Elasticsearch	3	1722	April 27, 2020
High Disk Read I/O in Elasticsearch Elasticsearch	1	2412	November 14, 2017
Only one node in cluster(6 nodes) slow query and high read I/O when querying Elasticsearch	7	1069	August 27, 2020
High read I/O and Load Average after upgrading to Elasticsearch 5.3.2 Elasticsearch	2	2054	August 25, 2017
Elasticsearch read operations very high Elasticsearch	1	396	January 11, 2019

High disk read on one node out of 3

Related topics