The below log snippet looked strange to me. It occurs after indexing has
stopped for awhile. It looks like elasticsearch closes the index for
writing (?) and that triggers some IndexWriter operation. The getReader
message reports it took 13 seconds, which seems like a long time. Nothing
else is really going on at this time, no indexing, heavy gc, etc. The same
call for this index on the replica took 2 seconds.
I am trying to follow the code, it seems like since packetCount=0 here, the
prune() method doesn't do anything. The only other thing of significance in
between that message and the "return reader ..." message is the
StandardDirectoryReader.open(this, segmentInfos, applyAllDeletes); which
unfortunately doesn't have any log messages. The getReader call on the
replica took 2 seconds on a healthy node. It happened around the same time.
Does this provide any clues? Like everything else, the getReader calls seem
to take much longer on the loaded node.
[2014-07-10 08:17:58,281][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: prune sis=org.apache.lucene.index.SegmentInfos@39c19aec minGen=0
packetCount=0
[2014-07-10 08:18:11,359][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: return reader version=155332
reader=StandardDirectoryReader(segments_la:155332:nrt
_thy(4.8):C7093414/62450:delGen=393 _cfi(4.7):C4517193/37932:delGen=622
_mrx(4.8):C5257709/46074:delGen=509 _zqo(4.8):C4946049/46139:delGen=293
_zte(4.8):C428210/13793:delGen=267 _10jx(4.8):C562710/16328:delGen=266
_11aw(4.8):C554410/16793:delGen=266 _1245(4.8):C670233/18142:delGen=247
_131a(4.8):C742879/18107:delGen=224 _13tl(4.8):C595917/15685:delGen=215
_1aw7(4.8):C517272/3261:delGen=87 _167b(4.8):C97197/12459:delGen=171
_1amh(4.8):C15328/1829:delGen=87 _1a0t(4.8):C9524/1392:delGen=92
_1acr(4.8):C8324/1421:delGen=87 _1atp(4.8):C5689/566:delGen=69
_1avx(4.8):C5077/244:delGen=60 _1avn(4.8):C5561/304:delGen=48
_1emy(4.8):C10448/482:delGen=14 _1csj(4.8):C8313/907:delGen=48
_1bu3(4.8):C3357/896:delGen=61 _1d3d(4.8):C3442/643:delGen=42
_1e5f(4.8):C5228/494:delGen=22 _1ex8(4.8):C2400/220:delGen=7
_1f1e(4.8):C957/50:delGen=2 _1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147)
[2014-07-10 08:18:11,362][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: elasticsearch[se5int-sv2][scheduler][T#1] finishFullFlush success=true
[2014-07-10 08:18:11,363][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: getReader took 13085 msec
Full log for that operation:
[2014-07-10 08:17:57,928][DEBUG][indices.memory ] [se5int-sv2]
marking shard [entry_20140702][0] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2014-07-10 08:17:57,929][DEBUG][index.engine.internal ] [se5int-sv2]
[entry_20140702][0] updating index_buffer_size from [512mb] to (inactive)
[500kb]
[2014-07-10 08:17:57,930][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: now flush at close waitForMerges=false
[2014-07-10 08:17:57,931][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: start flush: applyAllDeletes=true
[2014-07-10 08:17:57,933][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: index before flush _thy(4.8):C7093414/62450:delGen=393
_cfi(4.7):C4517193/37932:delGen=622 _mrx(4.8):C5257709/46074:delGen=509
_zqo(4.8):C4946049/46139:delGen=293 _zte(4.8):C428210/13793:delGen=267
_10jx(4.8):C562710/16328:delGen=266 _11aw(4.8):C554410/16793:delGen=266
_1245(4.8):C670233/18142:delGen=247 _131a(4.8):C742879/18107:delGen=224
_13tl(4.8):C595917/15685:delGen=215 _1aw7(4.8):C517272/3261:delGen=87
_167b(4.8):C97197/12459:delGen=171 _1amh(4.8):C15328/1829:delGen=87
_1a0t(4.8):C9524/1392:delGen=92 _1acr(4.8):C8324/1421:delGen=87
_1atp(4.8):C5689/566:delGen=69 _1avx(4.8):C5077/244:delGen=60
_1avn(4.8):C5561/304:delGen=48 _1emy(4.8):C10448/482:delGen=14
_1csj(4.8):C8313/907:delGen=48 _1bu3(4.8):C3357/896:delGen=61
_1d3d(4.8):C3442/643:delGen=42 _1e5f(4.8):C5228/494:delGen=22
_1ex8(4.8):C2400/220:delGen=7 _1f1e(4.8):C957/50:delGen=2
_1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147
[2014-07-10 08:17:57,934][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: startFullFlush
[2014-07-10 08:17:57,934][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
[2014-07-10 08:17:57,936][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: elasticsearch[se5int-sv2][scheduler][T#1] finishFullFlush success=true
[2014-07-10 08:17:57,936][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: apply all deletes during flush
[2014-07-10 08:17:57,938][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: applyDeletes: no deletes; skipping
[2014-07-10 08:17:57,938][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: prune sis=org.apache.lucene.index.SegmentInfos@1b8d1c15 minGen=76876
packetCount=0
[2014-07-10 08:17:57,939][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: all running merges have aborted
[2014-07-10 08:17:57,940][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: now call final commit()
[2014-07-10 08:17:57,941][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: start
[2014-07-10 08:17:57,941][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: enter lock
[2014-07-10 08:17:57,942][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: now prepare
[2014-07-10 08:17:57,942][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: prepareCommit: flush
[2014-07-10 08:17:57,943][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: index before flush _thy(4.8):C7093414/62450:delGen=393
_cfi(4.7):C4517193/37932:delGen=622 _mrx(4.8):C5257709/46074:delGen=509
_zqo(4.8):C4946049/46139:delGen=293 _zte(4.8):C428210/13793:delGen=267
_10jx(4.8):C562710/16328:delGen=266 _11aw(4.8):C554410/16793:delGen=266
_1245(4.8):C670233/18142:delGen=247 _131a(4.8):C742879/18107:delGen=224
_13tl(4.8):C595917/15685:delGen=215 _1aw7(4.8):C517272/3261:delGen=87
_167b(4.8):C97197/12459:delGen=171 _1amh(4.8):C15328/1829:delGen=87
_1a0t(4.8):C9524/1392:delGen=92 _1acr(4.8):C8324/1421:delGen=87
_1atp(4.8):C5689/566:delGen=69 _1avx(4.8):C5077/244:delGen=60
_1avn(4.8):C5561/304:delGen=48 _1emy(4.8):C10448/482:delGen=14
_1csj(4.8):C8313/907:delGen=48 _1bu3(4.8):C3357/896:delGen=61
_1d3d(4.8):C3442/643:delGen=42 _1e5f(4.8):C5228/494:delGen=22
_1ex8(4.8):C2400/220:delGen=7 _1f1e(4.8):C957/50:delGen=2
_1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147
[2014-07-10 08:17:57,943][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: startFullFlush
[2014-07-10 08:17:57,944][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
[2014-07-10 08:17:57,945][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: apply all deletes during flush
[2014-07-10 08:17:57,945][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: applyDeletes: no deletes; skipping
[2014-07-10 08:17:57,946][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: prune sis=org.apache.lucene.index.SegmentInfos@1b8d1c15 minGen=76876
packetCount=0
[2014-07-10 08:17:57,953][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: elasticsearch[se5int-sv2][scheduler][T#1] finishFullFlush success=true
[2014-07-10 08:17:57,953][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: startCommit(): start
[2014-07-10 08:17:57,954][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: skip startCommit(): no changes pending
[2014-07-10 08:17:57,955][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: pendingCommit == null; skip
[2014-07-10 08:17:57,956][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: done
[2014-07-10 08:17:57,963][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: at close: _thy(4.8):C7093414/62450:delGen=393
_cfi(4.7):C4517193/37932:delGen=622 _mrx(4.8):C5257709/46074:delGen=509
_zqo(4.8):C4946049/46139:delGen=293 _zte(4.8):C428210/13793:delGen=267
_10jx(4.8):C562710/16328:delGen=266 _11aw(4.8):C554410/16793:delGen=266
_1245(4.8):C670233/18142:delGen=247 _131a(4.8):C742879/18107:delGen=224
_13tl(4.8):C595917/15685:delGen=215 _1aw7(4.8):C517272/3261:delGen=87
_167b(4.8):C97197/12459:delGen=171 _1amh(4.8):C15328/1829:delGen=87
_1a0t(4.8):C9524/1392:delGen=92 _1acr(4.8):C8324/1421:delGen=87
_1atp(4.8):C5689/566:delGen=69 _1avx(4.8):C5077/244:delGen=60
_1avn(4.8):C5561/304:delGen=48 _1emy(4.8):C10448/482:delGen=14
_1csj(4.8):C8313/907:delGen=48 _1bu3(4.8):C3357/896:delGen=61
_1d3d(4.8):C3442/643:delGen=42 _1e5f(4.8):C5228/494:delGen=22
_1ex8(4.8):C2400/220:delGen=7 _1f1e(4.8):C957/50:delGen=2
_1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147
[2014-07-10 08:17:58,157][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: current segments file is "segments_la";
deletionPolicy=org.elasticsearch.index.deletionpolicy.SnapshotDeletionPolicy@667b43fc
[2014-07-10 08:17:58,161][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: load commit "segments_la"
[2014-07-10 08:17:58,215][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_1ex8_6.del"
[2014-07-10 08:17:58,216][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_1ex8_6.del"
[2014-07-10 08:17:58,219][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_thy_aw.del"
[2014-07-10 08:17:58,220][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_thy_aw.del"
[2014-07-10 08:17:58,236][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_1aw7_2e.del"
[2014-07-10 08:17:58,237][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_1aw7_2e.del"
[2014-07-10 08:17:58,240][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_1emy_d.del"
[2014-07-10 08:17:58,241][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_1emy_d.del"
[2014-07-10 08:17:58,242][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_cfi_h9.del"
[2014-07-10 08:17:58,243][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_cfi_h9.del"
[2014-07-10 08:17:58,251][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_mrx_e4.del"
[2014-07-10 08:17:58,252][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_mrx_e4.del"
[2014-07-10 08:17:58,262][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_1f1e_1.del"
[2014-07-10 08:17:58,263][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_1f1e_1.del"
[2014-07-10 08:17:58,266][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: now checkpoint "_thy(4.8):C7093414/62450:delGen=393
_cfi(4.7):C4517193/37932:delGen=622 _mrx(4.8):C5257709/46074:delGen=509
_zqo(4.8):C4946049/46139:delGen=293 _zte(4.8):C428210/13793:delGen=267
_10jx(4.8):C562710/16328:delGen=266 _11aw(4.8):C554410/16793:delGen=266
_1245(4.8):C670233/18142:delGen=247 _131a(4.8):C742879/18107:delGen=224
_13tl(4.8):C595917/15685:delGen=215 _1aw7(4.8):C517272/3261:delGen=87
_167b(4.8):C97197/12459:delGen=171 _1amh(4.8):C15328/1829:delGen=87
_1a0t(4.8):C9524/1392:delGen=92 _1acr(4.8):C8324/1421:delGen=87
_1atp(4.8):C5689/566:delGen=69 _1avx(4.8):C5077/244:delGen=60
_1avn(4.8):C5561/304:delGen=48 _1emy(4.8):C10448/482:delGen=14
_1csj(4.8):C8313/907:delGen=48 _1bu3(4.8):C3357/896:delGen=61
_1d3d(4.8):C3442/643:delGen=42 _1e5f(4.8):C5228/494:delGen=22
_1ex8(4.8):C2400/220:delGen=7 _1f1e(4.8):C957/50:delGen=2
_1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147" [27 segments ; isCommit = false]
[2014-07-10 08:17:58,272][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: 7 msec to checkpoint
[2014-07-10 08:17:58,273][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: init: create=false
[2014-07-10 08:17:58,277][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW:
[2014-07-10 08:17:58,278][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: flush at getReader
[2014-07-10 08:17:58,279][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: startFullFlush
[2014-07-10 08:17:58,279][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
[2014-07-10 08:17:58,280][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: apply all deletes during flush
[2014-07-10 08:17:58,280][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: applyDeletes: no deletes; skipping
[2014-07-10 08:17:58,281][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: prune sis=org.apache.lucene.index.SegmentInfos@39c19aec minGen=0
packetCount=0
[2014-07-10 08:18:11,359][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: return reader version=155332
reader=StandardDirectoryReader(segments_la:155332:nrt
_thy(4.8):C7093414/62450:delGen=393 _cfi(4.7):C4517193/37932:delGen=622
_mrx(4.8):C5257709/46074:delGen=509 _zqo(4.8):C4946049/46139:delGen=293
_zte(4.8):C428210/13793:delGen=267 _10jx(4.8):C562710/16328:delGen=266
_11aw(4.8):C554410/16793:delGen=266 _1245(4.8):C670233/18142:delGen=247
_131a(4.8):C742879/18107:delGen=224 _13tl(4.8):C595917/15685:delGen=215
_1aw7(4.8):C517272/3261:delGen=87 _167b(4.8):C97197/12459:delGen=171
_1amh(4.8):C15328/1829:delGen=87 _1a0t(4.8):C9524/1392:delGen=92
_1acr(4.8):C8324/1421:delGen=87 _1atp(4.8):C5689/566:delGen=69
_1avx(4.8):C5077/244:delGen=60 _1avn(4.8):C5561/304:delGen=48
_1emy(4.8):C10448/482:delGen=14 _1csj(4.8):C8313/907:delGen=48
_1bu3(4.8):C3357/896:delGen=61 _1d3d(4.8):C3442/643:delGen=42
_1e5f(4.8):C5228/494:delGen=22 _1ex8(4.8):C2400/220:delGen=7
_1f1e(4.8):C957/50:delGen=2 _1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147)
[2014-07-10 08:18:11,362][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: elasticsearch[se5int-sv2][scheduler][T#1] finishFullFlush success=true
[2014-07-10 08:18:11,363][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: getReader took 13085 msec
On Thursday, July 10, 2014 12:28:32 PM UTC-7, Kireet Reddy wrote:
we had an unrelated issue this morning where we had to stop indexing for
awhile. The problematic node remained busy for 1 hour afterwards, but then
became idle. We then restarted indexing and the node was fine. We didn't
restart the jvm, it seemed like something recovered and the node was able
to operate normally afterwards. It's very strange. I have copied the logs
for the time period here
https://www.dropbox.com/s/gb6xfxldryq80il/nodelogs.zip (has log for
problematic node5 and "normal" node6). Indexing stopped around 7:22, but
the node didn't seem to recover until 8:25. Unfortunately I don't have any
hot threads output during that time. I went through all our metrics, pretty
much all metric activity stops at 7:25 (see graphs). The only thing that
didn't was the merge total docs counter. I don't know why this would ever
decrease, but I have seen decreases in the past as well.
Not sure what to try at this point. The "custom" things I do are:
- language specific analysis using packaged tokenizers/filters and
ICU, smart-cn, kuromoji, and stempel plugins
- have eager field data loading turned on for a long field. However
the field data cache size seems ok, it was around 600MB, the breaker size
is 3.6GB. Also the size seemed to be around 100MB the last time the problem
happened so doesn't seem to be related.
- have a fair number of different indices (we use the index per time
period pattern).
- create/update documents via the bulk endpoint
Everything else seems pretty much by the book. It seems like I could stop
using the language specific analyzers or do the eager field data loading.
But both of those seem unlikely to be the issue to me. If it was an
analyzer, this should happen for the replica index as well. If it was field
data, the problem shouldn't go away unless we reduce the data size.
answers to your questions inline below. Thanks for the help! It's very
much appreciated. This is a really tough one...
On Thursday, July 10, 2014 2:59:40 AM UTC-7, Michael McCandless wrote:
Another question: have you disabled merge throttling? And, which version
of ES are you using?
we are using 1.2.1 (we upgraded to get the info stream logs). merge
throttling is enabled to default settings but doesn't seem to be an issue
(see graph)
Mike McCandless
http://blog.mikemccandless.com
On Thu, Jul 10, 2014 at 5:49 AM, Michael McCandless <
mi...@elasticsearch.com> wrote:
Indeed the hot threads on node5 didn't reveal anything unexpected: they
are busy merging and refreshing. One thread was in SmartChineseAnalyzer...
Very strange that young gen GC is suddenly so costly. This is much more
time spent in GC than you ever see on healthy nodes?
Yes, I see monitor.jvm lines in the log like:
[gc][young][593516][88318] duration [2.2s], collections [1]/[11.8s],
total [2.2s]/[3.6h], memory [17.2gb]->[16.8gb]/[31.9gb], all_pools {[young]
[631.6mb]->[61.6mb]/[665.6mb]}{[survivor]
[83.1mb]->[83.1mb]/[83.1mb]}{[old] [16.5gb]->[16.7gb]/[31.1gb]}
It's always the young gen, and seems to be every 5-10 seconds, with
duration being a few hundred ms to 3 seconds. I don't see any lines in the
log files of other nodes. I changed the config so that I get debug lines
for all collections instead of just slow ones. I'll try to confirm that the
collections are just slower and not more frequent on the loaded node.
If you restart the JVM on that node, does the problem immediately start
happening again? (Hmm: does restarting the JVM cause the others to start
building new replicas... which would mess up this experiment). Separately,
if you reboot the node and bring it back up, does the problem start
happening again?
stopping indexing for an hour seemed to "fix" the issue without a jvm
restart. we also just restarted our cluster to add the extra gc logging,
things are fine thus far.
Does vmstat/iostat reveal anything? E.g. is the OS doing a lot of
swapping or something? Do you mlockall your JVM heap (bootstrap.mlockall)?
What is swappiness set to?
mlockall is true, there is no swap used. everything seems normal at the
os level, low i/o 50% cpu usage, acceptable file descriptors, etc..
Which Java / OS versions are you using?
java: 1.7.0_55
OS: linux centos 6.4 64bit
Mike McCandless
http://blog.mikemccandless.com
On Wed, Jul 9, 2014 at 5:13 PM, Kireet Reddy kir...@feedly.com wrote:
Sorry, here it is:
Dropbox - Error - Simplify your life
Also a couple of graphs of the memory usage.
On Wednesday, July 9, 2014 2:10:49 PM UTC-7, Michael McCandless wrote:
Hmm link doesn't seem to work?
Mike McCandless
http://blog.mikemccandless.com
On Wed, Jul 9, 2014 at 4:06 PM, Kireet Reddy kir...@feedly.com
wrote:
Sorry, forgot the link
Dropbox - Error - Simplify your life
1229.txt
On Wednesday, July 9, 2014 1:05:56 PM UTC-7, Kireet Reddy wrote:
The problem is happening again, this time on node 5. I have captured
a few hot thread requests here. I also included one from node 6 (which is
now fine).There are merge related stacks, but it seems like everything is
taking a lot more cpu than usual. I did a few type=wait and type=block
dumps and the result was always 0% usage there. Also young gen gc activity
has again gone crazy (old gen/heap size seems fine). Would hot thread
measurements be distorted if gc activity is very high?
It seems strange to me that this would only happen on one node while
we have replica set to at least 1 for all our indices. It seems like the
problems should happen on a couple nodes simultaneously.
--Kireet
On Monday, July 7, 2014 3:51:35 PM UTC-7, Michael McCandless wrote:
Could you pull all hot threads next time the problem happens?
Mike McCandless
http://blog.mikemccandless.com
On Mon, Jul 7, 2014 at 3:47 PM, Kireet Reddy kir...@feedly.com
wrote:
All that seems correct (except I think this is for node 6, not
node 5). We don't delete documents, but we do some updates. The vast
majority of documents get indexed into the large shards, but the smaller
ones take some writes as well.
We aren't using virtualized hardware and elasticsearch is the only
thing running on the machines, no scheduled jobs, etc. I don't think
something is interfering, actually overall disk i/o rate and operations on
the machine go down quite a bit during the problematic period, which is
consistent with your observations about things taking longer.
I went back and checked all our collected metrics again. I noticed
that even though the heap usage and gc count seems smooth during the period
in question, gc time spent goes way up. Also active indexing threads goes
up, but since our ingest rate didn't go up I assumed this was a side
effect. During a previous occurrence a few days ago on node5, I stopped all
indexing activity for 15 minutes. Active merges and indexing requests went
to zero as expected. Then I re-enabled indexing and immediately the
increased cpu/gc/active merges went back up to the problematic rates.
Overall this is pretty confusing to me as to what is a symptom vs
a root cause here. A summary of what I think I know:
- Every few days, cpu usage on a node goes way above the
other nodes and doesn't recover. We've let the node run in the elevated cpu
state for a day with no improvement.
- It doesn't seem likely that it's data related. We use
replicas=1 and no other nodes have issues.
- It doesn't seem hardware related. We run on a dedicated h/w
with elasticsearch being the only thing running. Also the problem appears
on various nodes and machine load seems tied directly to the elasticsearch
process.
- During the problematic period: cpu usage, active merge
threads, active bulk (indexing) threads, and gc time are elevated.
- During the problematic period: i/o ops and i/o throughput
decrease.
- overall heap usage size seems to smoothly increase, the
extra gc time seems to be spent on the new gen. Interestingly, the gc count
didn't seem to increase.
- In the hours beforehand, gc behavior of the problematic
node was similar to the other nodes.
- If I pause indexing, machine load quickly returns to
normal, merges and indexing requests complete. if I then restart indexing
the problem reoccurs immediately.
- If I disable automatic refreshes, the problem disappears
within an hour or so.
- hot threads show merging activity as the hot threads.
The first few points make me think the increased active merges is
perhaps a side effect, but then the last 3 make me think merging is the
root cause. The only additional things I can think of that may be relevant
are:
- Our documents can vary greatly in size, they average a
couple KB but can rarely be several MB.
- we do use language analysis plugins, perhaps one of these
is acting up?
- We eagerly load one field into the field data cache. But
the cache size is ok and the overall heap behavior is ok so I don't think
this is the problem.
That's a lot of information, but I am not sure where to go next
from here...
On Monday, July 7, 2014 8:23:20 AM UTC-7, Michael McCandless wrote:
Indeed there are no big merges during that time ...
I can see on node5, ~14:45 suddenly merges are taking a long
time, refresh is taking much longer (4-5 seconds instead of < .4 sec),
commit time goes up from < 0.5 sec to ~1-2 sec, etc., but other metrics are
fine e.g. total merging GB, number of commits/refreshes is very low during
this time.
Each node has 2 biggish (~17 GB) shards and then ~50 tiny shards.
The biggish shards are indexing at a very slow rate and only have ~1%
deletions. Are you explicitly deleting docs?
I suspect something is suddenly cutting into the IO perf of this
box, and because merging/refreshing is so IO intensive, it causes these
operations to run slower / backlog.
Are there any scheduled jobs, e.g. backups/snapshots, that start
up? Are you running on virtualized hardware?
Mike McCandless
http://blog.mikemccandless.com
On Sun, Jul 6, 2014 at 8:23 PM, Kireet Reddy kir...@feedly.com
wrote:
Just to reiterate, the problematic period is from 07/05 14:45
to 07/06 02:10. I included a couple hours before and after in the logs.
On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote:
They are linked below (node5 is the log of the normal node,
node6 is the log of the problematic node).
I don't think it was doing big merges, otherwise during the
high load period, the merges graph line would have had a "floor" > 0,
similar to the time period after I disabled refresh. We don't do routing
and use mostly default settings. I think the only settings we changed are:
indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 50000
We are running on a 6 cpu/12 cores machine with a 32GB heap and
96GB of memory with 4 spinning disks.
node 5 log (normal)
https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip
node 6 log (high load)
https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip
On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless
wrote:
Can you post the IndexWriter infoStream output? I can see if
anything stands out.
Maybe it was just that this node was doing big merges? I.e.,
if you waited long enough, the other shards would eventually do their big
merges too?
Have you changed any default settings, do custom routing,
etc.? Is there any reason to think that the docs that land on this node
are "different" in any way?
Mike McCandless
http://blog.mikemccandless.com
On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy <
kir...@feedly.com> wrote:
From all the information I’ve collected, it seems to be the
merging activity:
- We capture the cluster stats into graphite and the
current merges stat seems to be about 10x higher on this node.
- The actual node that the problem occurs on has
happened on different physical machines so a h/w issue seems unlikely. Once
the problem starts it doesn't seem to stop. We have blown away the indices
in the past and started indexing again after enabling more logging/stats.
- I've stopped executing queries so the only thing
that's happening on the cluster is indexing.
- Last night when the problem was ongoing, I disabled
refresh (index.refresh_interval = -1) around 2:10am. Within 1 hour, the
load returned to normal. The merge activity seemed to reduce, it seems like
2 very long running merges are executing but not much else.
- I grepped an hour of logs of the 2 machiese for "add
merge=", it was 540 on the high load node and 420 on a normal node. I
pulled out the size value from the log message and the merges seemed to be
much smaller on the high load node.
I just created the indices a few days ago, so the shards of
each index are balanced across the nodes. We have external metrics around
document ingest rate and there was no spike during this time period.
Thanks
Kireet
On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless
wrote:
It's perfectly normal/healthy for many small merges below
the floor size to happen.
I think you should first figure out why this node is
different from the others? Are you sure it's merging CPU cost that's
different?
Mike McCandless
http://blog.mikemccandless.com
On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy <
kir...@feedly.com> wrote:
We have a situation where one of the four nodes in our
cluster seems to get caught up endlessly merging. However
it seems to be high CPU activity and not I/O constrainted. I have enabled
the IndexWriter info stream logs, and often times it seems to do merges of
quite small segments (100KB) that are much below the floor size (2MB). I
suspect this is due to frequent refreshes and/or using lots of threads
concurrently to do indexing. Is this true?
My supposition is that this is leading to the merge policy
doing lots of merges of very small segments into another small segment
which will again require a merge to even reach the floor size. My index has
64 segments and 25 are below the floor size. I am wondering if there should
be an exception for the maxMergesAtOnce parameter for the first level so
that many small segments could be merged at once in this case.
I am considering changing the other parameters (wider
tiers, lower floor size, more concurrent merges allowed) but these all seem
to have side effects I may not necessarily want. Is there a good solution
here?
--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails
from it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-
ae0b-49cb-b29d-e396510bf755%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/edc22069-867
4-41db-ab06-226b05d293aa%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/edc22069-8674-41db-ab06-226b05d293aa%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the
Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from
it, send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6d
d-4de2-aa59-003f57d2b446%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a94c2dee-d6dd-4de2-aa59-003f57d2b446%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/a855fbe1-13f
3-4c53-8607-a26b8063de1f%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/a855fbe1-13f3-4c53-8607-a26b8063de1f%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it,
send an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/
msgid/elasticsearch/0f46aa67-0730-4442-9dd6-72a1f01cb8f4%
40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/0f46aa67-0730-4442-9dd6-72a1f01cb8f4%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google
Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to elasticsearc...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/5ec4e59a-ab42-4d3c-b950-8b6246dac1dc%40googlegroups.com
https://groups.google.com/d/msgid/elasticsearch/5ec4e59a-ab42-4d3c-b950-8b6246dac1dc%40googlegroups.com?utm_medium=email&utm_source=footer
.
For more options, visit https://groups.google.com/d/optout.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0d42af58-ce02-4655-9261-da40b8aaaef4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.