Segment merging takes infinity time

In my cluster sometimes I see dramaticaly performance drops.
I found interesting lines in a log file

[2018-05-10T12:19:32,563][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [server-v1] [logs-space-2018w19][1] merge segment [_2zob] done: took [2h], [4.8 MB], [17,485 docs], [0s stopped], [0s throttled], [5.0 MB written], [Infinity MB/sec throttle]
[2018-05-10T12:19:40,122][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [server-v1] [logs-async-2018w19][2] merge segment [_28cq] done: took [1.8h], [4.6 MB], [18,662 docs], [0s stopped], [0s throttled], [5.0 MB written], [Infinity MB/sec throttle]
[2018-05-10T12:19:51,276][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [server-v1] [logs-space-2018w19][2] merge segment [_2z85] done: took [2h], [4.7 MB], [17,218 docs], [0s stopped], [0s throttled], [4.0 MB written], [Infinity MB/sec throttle]

_nodes/hot_threads also confirms my guess

    org.apache.lucene.store.DataInput.readVInt(DataInput.java:125)
    org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$BlockPostingsEnum.refillPositions(Lucene50PostingsReader.java:618)
    org.apache.lucene.codecs.lucene50.Lucene50PostingsReader$BlockPostingsEnum.nextPosition(Lucene50PostingsReader.java:773)
    org.apache.lucene.index.MappingMultiPostingsEnum.nextPosition(MappingMultiPostingsEnum.java:113)
    org.apache.lucene.codecs.PushPostingsWriterBase.writeTerm(PushPostingsWriterBase.java:145)
    org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter$TermsWriter.write(BlockTreeTermsWriter.java:864)
    org.apache.lucene.codecs.blocktree.BlockTreeTermsWriter.write(BlockTreeTermsWriter.java:343)
    org.apache.lucene.codecs.FieldsConsumer.merge(FieldsConsumer.java:105)
    org.apache.lucene.codecs.perfield.PerFieldPostingsFormat$FieldsWriter.merge(PerFieldPostingsFormat.java:164)
    org.apache.lucene.index.SegmentMerger.mergeTerms(SegmentMerger.java:230)
    org.apache.lucene.index.SegmentMerger.merge(SegmentMerger.java:115)
    org.apache.lucene.index.IndexWriter.mergeMiddle(IndexWriter.java:4443)
    org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:4083)
    org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:624)
    org.elasticsearch.index.engine.ElasticsearchConcurrentMergeScheduler.doMerge(ElasticsearchConcurrentMergeScheduler.java:99)
    org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:661)

I use Elasticsearch-6.2.4.

How many indices and shards do you have? What type of storage? Is the cluster under heavy load?

green open logs-async-2018w19 62QYjxxgTWehdL31jHBdaw 3 1  61895805      0  26.9gb 13.1gb
green open logs-space-2018w19 YlUy6bIqTyuYyXoODazH7w 3 1 154916183      0  64.8gb 32.5gb
green open logs-async-2018w17 fjHgMyOJTnaCEooFSXsFVQ 3 1  15162224      0  10.4gb  5.2gb
green open logs-space-2018w17 4cBSd8nQTxedCkqawt6quw 3 1 293250335      0 168.3gb 84.1gb
green open crm-elapsed        jbeb_s-oTOuefCgvgm_EDQ 1 0  71571025 170331  50.7gb 50.7gb
green open logs-space-2018w18 LmwdkSxFQhqTAOucnSU2Qw 3 1 144382941      0    60gb   30gb
green open logs-cpu-2018w18   MBuQP8E2T2mSlRGjN5MzxA 3 1  26433694      0     8gb    4gb
green open logs-cpu-2018w19   wk7jSSTZQ0GGYsRGGm3KrA 3 1  17556130      0   5.6gb  2.8gb
green open space-elapsed      V_uYe6rkQuidUDT9n4xdIw 3 1  85823832 138270  95.7gb 47.8gb
green open logs-crm-2018w14   ufBMxFBfRkW4n-XD1Dx-Xw 1 0 265736760      0  76.4gb 76.4gb
green open .kibana            w31EoElpSUiRFl0U-ErLng 1 1        11      4  98.6kb 49.3kb
green open logs-async-2018w18 mhPeEDUOTFyoLR0Y1isobg 3 1  24481814      0  12.1gb    6gb
green open logs-cpu-2018w17   -7Hj5xOiS4atsbLXNSLyxg 3 1  19516524      0   7.9gb  3.9gb
green open logs-space-2018w15 oBgP_B9AQZSSbMBvFTIiIQ 5 0 377926709      0  85.1gb 85.1gb
green open logs-space-2018w16 w7UVXIkTRlqC0D_svMSd5g 3 1 343578741      0 149.7gb 74.8gb

In current situation I've just started reindex operation

POST /_reindex?slices=3&refresh
{
  "source": {
    "index": "logs-space-2018w17",
    "size": 10000,
    "query":{  
      "bool":{  
         "must":[  
            {  
               "query_string":{  
                  "query":"Sender:AppLogger",
                  "analyze_wildcard":true
               }
            }
        ]
      }
    }
  },
  "dest": {
    "index": "space-elapsed"
  }
}

Current cluster installation:

  • mater+data node 14Gb + SSD - two nodes
  • master node 14Gb - one node

On one node I see extremely read IO load.
Previously I had similar problem (same stacktrace /hot_threads) and fixed it by changing indices.memory.index_buffer_size: 50%

After restarting problem node everything is ok

[2018-05-11T06:40:43,269][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [crmlogs-es-iva-1-common-storage-crm-v2] [logs-async-2018w19][0] merge segment [_2uen] done: took [37.8s], [334.1 MB], [1,561,617 docs], [0s stopped], [20.7s throttled], [640.7 MB written], [21.8 MB/sec throttle]
[2018-05-11T06:40:44,757][DEBUG][o.e.i.e.InternalEngine$EngineMergeScheduler] [crmlogs-es-iva-1-common-storage-crm-v2] [logs-async-2018w19][2] merge segment [_2t15] done: took [59.8s], [390.0 MB], [1,666,121 docs], [0s stopped], [40.4s throttled], [753.3 MB written], [13.7 MB/sec throttle]

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.