The problem reappeared. We did some tests today around copying a large file
on the nodes to test i/o throughput. On the loaded node, the copy was
really slow, maybe 30x slower. So it seems your suspicion around something
external interfering with I/O was right in the end even though nothing else
is running on the machines. We will investigate our setup further but this
doesn't seem like a lucene/elasticsearch issue in the end.
For the index close, I didn't issue any command, elasticsearch seemed to do
that on its own. The code is in IndexingMemoryController. The triggering
event seems to be the ram buffer size change, this triggers a call to
InternalEngine.updateIndexingBufferSize() which then calls flush with type
NEW_WRITER. That seems to close the lucene IndexWriter.
Thanks!
Kireet
On Friday, July 11, 2014 4:13:52 AM UTC-7, Michael McCandless wrote:
Hmm this is interesting: it looks like, after 30 minutes of inactivity, ES
drops the RAM buffer size for this index to tiny value (500 KB vs 500 MB
before), but then, the IW is closed (I don't think ES does this by default,
but I'm not sure: do you have a client that may have closed the index?).Then, right after that (~200 msec later) it opens the IndexWriter again,
and calls getReader().That new getReader is expected to be costly (13 seconds) because the IW
must open every segment, vs the more typical case where there was already
an NRT reader open and now, having written a few new segments or completed
a merge or two, it only needs to open those new segments.So the question is why is this IW closed and immediately opened again?
Mike McCandless
http://blog.mikemccandless.com
On Thu, Jul 10, 2014 at 8:12 PM, Kireet Reddy <kir...@feedly.com
<javascript:>> wrote:The below log snippet looked strange to me. It occurs after indexing has
stopped for awhile. It looks like elasticsearch closes the index for
writing (?) and that triggers some IndexWriter operation. The getReader
message reports it took 13 seconds, which seems like a long time. Nothing
else is really going on at this time, no indexing, heavy gc, etc. The same
call for this index on the replica took 2 seconds.I am trying to follow the code, it seems like since packetCount=0 here,
the prune() method doesn't do anything. The only other thing of
significance in between that message and the "return reader ..." message is
the StandardDirectoryReader.open(this, segmentInfos, applyAllDeletes);
which unfortunately doesn't have any log messages. The getReader call on
the replica took 2 seconds on a healthy node. It happened around the same
time. Does this provide any clues? Like everything else, the getReader
calls seem to take much longer on the loaded node.[2014-07-10 08:17:58,281][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: prune sis=org.apache.lucene.index.SegmentInfos@39c19aec minGen=0
packetCount=0
[2014-07-10 08:18:11,359][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: return reader version=155332
reader=StandardDirectoryReader(segments_la:155332:nrt
_thy(4.8):C7093414/62450:delGen=393 _cfi(4.7):C4517193/37932:delGen=622
_mrx(4.8):C5257709/46074:delGen=509 _zqo(4.8):C4946049/46139:delGen=293
_zte(4.8):C428210/13793:delGen=267 _10jx(4.8):C562710/16328:delGen=266
_11aw(4.8):C554410/16793:delGen=266 _1245(4.8):C670233/18142:delGen=247
_131a(4.8):C742879/18107:delGen=224 _13tl(4.8):C595917/15685:delGen=215
_1aw7(4.8):C517272/3261:delGen=87 _167b(4.8):C97197/12459:delGen=171
_1amh(4.8):C15328/1829:delGen=87 _1a0t(4.8):C9524/1392:delGen=92
_1acr(4.8):C8324/1421:delGen=87 _1atp(4.8):C5689/566:delGen=69
_1avx(4.8):C5077/244:delGen=60 _1avn(4.8):C5561/304:delGen=48
_1emy(4.8):C10448/482:delGen=14 _1csj(4.8):C8313/907:delGen=48
_1bu3(4.8):C3357/896:delGen=61 _1d3d(4.8):C3442/643:delGen=42
_1e5f(4.8):C5228/494:delGen=22 _1ex8(4.8):C2400/220:delGen=7
_1f1e(4.8):C957/50:delGen=2 _1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147)
[2014-07-10 08:18:11,362][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: elasticsearch[se5int-sv2][scheduler][T#1] finishFullFlush success=true
[2014-07-10 08:18:11,363][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: getReader took 13085 msecFull log for that operation:
[2014-07-10 08:17:57,928][DEBUG][indices.memory ] [se5int-sv2]
marking shard [entry_20140702][0] as inactive (inactive_time[30m]) indexing
wise, setting size to [500kb]
[2014-07-10 08:17:57,929][DEBUG][index.engine.internal ] [se5int-sv2]
[entry_20140702][0] updating index_buffer_size from [512mb] to (inactive)
[500kb]
[2014-07-10 08:17:57,930][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: now flush at close waitForMerges=false
[2014-07-10 08:17:57,931][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: start flush: applyAllDeletes=true
[2014-07-10 08:17:57,933][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: index before flush _thy(4.8):C7093414/62450:delGen=393
_cfi(4.7):C4517193/37932:delGen=622 _mrx(4.8):C5257709/46074:delGen=509
_zqo(4.8):C4946049/46139:delGen=293 _zte(4.8):C428210/13793:delGen=267
_10jx(4.8):C562710/16328:delGen=266 _11aw(4.8):C554410/16793:delGen=266
_1245(4.8):C670233/18142:delGen=247 _131a(4.8):C742879/18107:delGen=224
_13tl(4.8):C595917/15685:delGen=215 _1aw7(4.8):C517272/3261:delGen=87
_167b(4.8):C97197/12459:delGen=171 _1amh(4.8):C15328/1829:delGen=87
_1a0t(4.8):C9524/1392:delGen=92 _1acr(4.8):C8324/1421:delGen=87
_1atp(4.8):C5689/566:delGen=69 _1avx(4.8):C5077/244:delGen=60
_1avn(4.8):C5561/304:delGen=48 _1emy(4.8):C10448/482:delGen=14
_1csj(4.8):C8313/907:delGen=48 _1bu3(4.8):C3357/896:delGen=61
_1d3d(4.8):C3442/643:delGen=42 _1e5f(4.8):C5228/494:delGen=22
_1ex8(4.8):C2400/220:delGen=7 _1f1e(4.8):C957/50:delGen=2
_1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147
[2014-07-10 08:17:57,934][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: startFullFlush
[2014-07-10 08:17:57,934][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
[2014-07-10 08:17:57,936][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: elasticsearch[se5int-sv2][scheduler][T#1] finishFullFlush success=true
[2014-07-10 08:17:57,936][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: apply all deletes during flush
[2014-07-10 08:17:57,938][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: applyDeletes: no deletes; skipping
[2014-07-10 08:17:57,938][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: prune sis=org.apache.lucene.index.SegmentInfos@1b8d1c15 minGen=76876
packetCount=0
[2014-07-10 08:17:57,939][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: all running merges have aborted
[2014-07-10 08:17:57,940][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: now call final commit()
[2014-07-10 08:17:57,941][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: start
[2014-07-10 08:17:57,941][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: enter lock
[2014-07-10 08:17:57,942][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: now prepare
[2014-07-10 08:17:57,942][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: prepareCommit: flush
[2014-07-10 08:17:57,943][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: index before flush _thy(4.8):C7093414/62450:delGen=393
_cfi(4.7):C4517193/37932:delGen=622 _mrx(4.8):C5257709/46074:delGen=509
_zqo(4.8):C4946049/46139:delGen=293 _zte(4.8):C428210/13793:delGen=267
_10jx(4.8):C562710/16328:delGen=266 _11aw(4.8):C554410/16793:delGen=266
_1245(4.8):C670233/18142:delGen=247 _131a(4.8):C742879/18107:delGen=224
_13tl(4.8):C595917/15685:delGen=215 _1aw7(4.8):C517272/3261:delGen=87
_167b(4.8):C97197/12459:delGen=171 _1amh(4.8):C15328/1829:delGen=87
_1a0t(4.8):C9524/1392:delGen=92 _1acr(4.8):C8324/1421:delGen=87
_1atp(4.8):C5689/566:delGen=69 _1avx(4.8):C5077/244:delGen=60
_1avn(4.8):C5561/304:delGen=48 _1emy(4.8):C10448/482:delGen=14
_1csj(4.8):C8313/907:delGen=48 _1bu3(4.8):C3357/896:delGen=61
_1d3d(4.8):C3442/643:delGen=42 _1e5f(4.8):C5228/494:delGen=22
_1ex8(4.8):C2400/220:delGen=7 _1f1e(4.8):C957/50:delGen=2
_1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147
[2014-07-10 08:17:57,943][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: startFullFlush
[2014-07-10 08:17:57,944][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
[2014-07-10 08:17:57,945][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: apply all deletes during flush
[2014-07-10 08:17:57,945][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: applyDeletes: no deletes; skipping
[2014-07-10 08:17:57,946][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: prune sis=org.apache.lucene.index.SegmentInfos@1b8d1c15 minGen=76876
packetCount=0
[2014-07-10 08:17:57,953][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: elasticsearch[se5int-sv2][scheduler][T#1] finishFullFlush success=true
[2014-07-10 08:17:57,953][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: startCommit(): start
[2014-07-10 08:17:57,954][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: skip startCommit(): no changes pending
[2014-07-10 08:17:57,955][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: pendingCommit == null; skip
[2014-07-10 08:17:57,956][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: commit: done
[2014-07-10 08:17:57,963][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: at close: _thy(4.8):C7093414/62450:delGen=393
_cfi(4.7):C4517193/37932:delGen=622 _mrx(4.8):C5257709/46074:delGen=509
_zqo(4.8):C4946049/46139:delGen=293 _zte(4.8):C428210/13793:delGen=267
_10jx(4.8):C562710/16328:delGen=266 _11aw(4.8):C554410/16793:delGen=266
_1245(4.8):C670233/18142:delGen=247 _131a(4.8):C742879/18107:delGen=224
_13tl(4.8):C595917/15685:delGen=215 _1aw7(4.8):C517272/3261:delGen=87
_167b(4.8):C97197/12459:delGen=171 _1amh(4.8):C15328/1829:delGen=87
_1a0t(4.8):C9524/1392:delGen=92 _1acr(4.8):C8324/1421:delGen=87
_1atp(4.8):C5689/566:delGen=69 _1avx(4.8):C5077/244:delGen=60
_1avn(4.8):C5561/304:delGen=48 _1emy(4.8):C10448/482:delGen=14
_1csj(4.8):C8313/907:delGen=48 _1bu3(4.8):C3357/896:delGen=61
_1d3d(4.8):C3442/643:delGen=42 _1e5f(4.8):C5228/494:delGen=22
_1ex8(4.8):C2400/220:delGen=7 _1f1e(4.8):C957/50:delGen=2
_1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147
[2014-07-10 08:17:58,157][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: current segments file is "segments_la";
deletionPolicy=org.elasticsearch.index.deletionpolicy.SnapshotDeletionPolicy@667b43fc
[2014-07-10 08:17:58,161][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: load commit "segments_la"
[2014-07-10 08:17:58,215][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_1ex8_6.del"
[2014-07-10 08:17:58,216][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_1ex8_6.del"
[2014-07-10 08:17:58,219][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_thy_aw.del"
[2014-07-10 08:17:58,220][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_thy_aw.del"
[2014-07-10 08:17:58,236][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_1aw7_2e.del"
[2014-07-10 08:17:58,237][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_1aw7_2e.del"
[2014-07-10 08:17:58,240][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_1emy_d.del"
[2014-07-10 08:17:58,241][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_1emy_d.del"
[2014-07-10 08:17:58,242][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_cfi_h9.del"
[2014-07-10 08:17:58,243][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_cfi_h9.del"
[2014-07-10 08:17:58,251][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_mrx_e4.del"
[2014-07-10 08:17:58,252][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_mrx_e4.del"
[2014-07-10 08:17:58,262][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: init: removing unreferenced file "_1f1e_1.del"
[2014-07-10 08:17:58,263][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: delete "_1f1e_1.del"
[2014-07-10 08:17:58,266][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: now checkpoint "_thy(4.8):C7093414/62450:delGen=393
_cfi(4.7):C4517193/37932:delGen=622 _mrx(4.8):C5257709/46074:delGen=509
_zqo(4.8):C4946049/46139:delGen=293 _zte(4.8):C428210/13793:delGen=267
_10jx(4.8):C562710/16328:delGen=266 _11aw(4.8):C554410/16793:delGen=266
_1245(4.8):C670233/18142:delGen=247 _131a(4.8):C742879/18107:delGen=224
_13tl(4.8):C595917/15685:delGen=215 _1aw7(4.8):C517272/3261:delGen=87
_167b(4.8):C97197/12459:delGen=171 _1amh(4.8):C15328/1829:delGen=87
_1a0t(4.8):C9524/1392:delGen=92 _1acr(4.8):C8324/1421:delGen=87
_1atp(4.8):C5689/566:delGen=69 _1avx(4.8):C5077/244:delGen=60
_1avn(4.8):C5561/304:delGen=48 _1emy(4.8):C10448/482:delGen=14
_1csj(4.8):C8313/907:delGen=48 _1bu3(4.8):C3357/896:delGen=61
_1d3d(4.8):C3442/643:delGen=42 _1e5f(4.8):C5228/494:delGen=22
_1ex8(4.8):C2400/220:delGen=7 _1f1e(4.8):C957/50:delGen=2
_1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147" [27 segments ; isCommit = false]
[2014-07-10 08:17:58,272][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IFD: 7 msec to checkpoint
[2014-07-10 08:17:58,273][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: init: create=false
[2014-07-10 08:17:58,277][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW:
[2014-07-10 08:17:58,278][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: flush at getReader
[2014-07-10 08:17:58,279][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: startFullFlush
[2014-07-10 08:17:58,279][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: anyChanges? numDocsInRam=0 deletes=false hasTickets:false
pendingChangesInFullFlush: false
[2014-07-10 08:17:58,280][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: apply all deletes during flush
[2014-07-10 08:17:58,280][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: applyDeletes: no deletes; skipping
[2014-07-10 08:17:58,281][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
BD: prune sis=org.apache.lucene.index.SegmentInfos@39c19aec minGen=0
packetCount=0
[2014-07-10 08:18:11,359][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: return reader version=155332
reader=StandardDirectoryReader(segments_la:155332:nrt
_thy(4.8):C7093414/62450:delGen=393 _cfi(4.7):C4517193/37932:delGen=622
_mrx(4.8):C5257709/46074:delGen=509 _zqo(4.8):C4946049/46139:delGen=293
_zte(4.8):C428210/13793:delGen=267 _10jx(4.8):C562710/16328:delGen=266
_11aw(4.8):C554410/16793:delGen=266 _1245(4.8):C670233/18142:delGen=247
_131a(4.8):C742879/18107:delGen=224 _13tl(4.8):C595917/15685:delGen=215
_1aw7(4.8):C517272/3261:delGen=87 _167b(4.8):C97197/12459:delGen=171
_1amh(4.8):C15328/1829:delGen=87 _1a0t(4.8):C9524/1392:delGen=92
_1acr(4.8):C8324/1421:delGen=87 _1atp(4.8):C5689/566:delGen=69
_1avx(4.8):C5077/244:delGen=60 _1avn(4.8):C5561/304:delGen=48
_1emy(4.8):C10448/482:delGen=14 _1csj(4.8):C8313/907:delGen=48
_1bu3(4.8):C3357/896:delGen=61 _1d3d(4.8):C3442/643:delGen=42
_1e5f(4.8):C5228/494:delGen=22 _1ex8(4.8):C2400/220:delGen=7
_1f1e(4.8):C957/50:delGen=2 _1f29(4.8):C187/1:delGen=1 _1f2i(4.8):C147)
[2014-07-10 08:18:11,362][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
DW: elasticsearch[se5int-sv2][scheduler][T#1] finishFullFlush success=true
[2014-07-10 08:18:11,363][TRACE][lucene.iw ]
[se5int-sv2][entry_20140702][0] elasticsearch[se5int-sv2][scheduler][T#1]
IW: getReader took 13085 msecOn Thursday, July 10, 2014 12:28:32 PM UTC-7, Kireet Reddy wrote:
we had an unrelated issue this morning where we had to stop indexing for
awhile. The problematic node remained busy for 1 hour afterwards, but then
became idle. We then restarted indexing and the node was fine. We didn't
restart the jvm, it seemed like something recovered and the node was able
to operate normally afterwards. It's very strange. I have copied the logs
for the time period here
https://www.dropbox.com/s/gb6xfxldryq80il/nodelogs.zip (has log for
problematic node5 and "normal" node6). Indexing stopped around 7:22, but
the node didn't seem to recover until 8:25. Unfortunately I don't have any
hot threads output during that time. I went through all our metrics, pretty
much all metric activity stops at 7:25 (see graphs). The only thing that
didn't was the merge total docs counter. I don't know why this would ever
decrease, but I have seen decreases in the past as well.Not sure what to try at this point. The "custom" things I do are:
- language specific analysis using packaged tokenizers/filters and
ICU, smart-cn, kuromoji, and stempel plugins- have eager field data loading turned on for a long field. However
the field data cache size seems ok, it was around 600MB, the breaker size
is 3.6GB. Also the size seemed to be around 100MB the last time the problem
happened so doesn't seem to be related.- have a fair number of different indices (we use the index per time
period pattern).- create/update documents via the bulk endpoint
Everything else seems pretty much by the book. It seems like I could stop
using the language specific analyzers or do the eager field data loading.
But both of those seem unlikely to be the issue to me. If it was an
analyzer, this should happen for the replica index as well. If it was field
data, the problem shouldn't go away unless we reduce the data size.answers to your questions inline below. Thanks for the help! It's very
much appreciated. This is a really tough one...
On Thursday, July 10, 2014 2:59:40 AM UTC-7, Michael McCandless wrote:Another question: have you disabled merge throttling? And, which version
of ES are you using?we are using 1.2.1 (we upgraded to get the info stream logs). merge
throttling is enabled to default settings but doesn't seem to be an issue
(see graph)Mike McCandless
http://blog.mikemccandless.com
On Thu, Jul 10, 2014 at 5:49 AM, Michael McCandless <
mi...@elasticsearch.com> wrote:Indeed the hot threads on node5 didn't reveal anything unexpected: they
are busy merging and refreshing. One thread was in SmartChineseAnalyzer...Very strange that young gen GC is suddenly so costly. This is much more
time spent in GC than you ever see on healthy nodes?Yes, I see monitor.jvm lines in the log like:
[gc][young][593516][88318] duration [2.2s], collections [1]/[11.8s],
total [2.2s]/[3.6h], memory [17.2gb]->[16.8gb]/[31.9gb], all_pools {[young]
[631.6mb]->[61.6mb]/[665.6mb]}{[survivor] [83.1mb]->[83.1mb]/[83.1mb]}{[old]
[16.5gb]->[16.7gb]/[31.1gb]}
It's always the young gen, and seems to be every 5-10 seconds, with
duration being a few hundred ms to 3 seconds. I don't see any lines in the
log files of other nodes. I changed the config so that I get debug lines
for all collections instead of just slow ones. I'll try to confirm that the
collections are just slower and not more frequent on the loaded node.If you restart the JVM on that node, does the problem immediately start
happening again? (Hmm: does restarting the JVM cause the others to start
building new replicas... which would mess up this experiment). Separately,
if you reboot the node and bring it back up, does the problem start
happening again?stopping indexing for an hour seemed to "fix" the issue without a jvm
restart. we also just restarted our cluster to add the extra gc logging,
things are fine thus far.Does vmstat/iostat reveal anything? E.g. is the OS doing a lot of
swapping or something? Do you mlockall your JVM heap (bootstrap.mlockall)?
What is swappiness set to?mlockall is true, there is no swap used. everything seems normal at the
os level, low i/o 50% cpu usage, acceptable file descriptors, etc..Which Java / OS versions are you using?
java: 1.7.0_55
OS: linux centos 6.4 64bitMike McCandless
http://blog.mikemccandless.com
On Wed, Jul 9, 2014 at 5:13 PM, Kireet Reddy kir...@feedly.com wrote:
Sorry, here it is:
Dropbox - Error - Simplify your life
Also a couple of graphs of the memory usage.
On Wednesday, July 9, 2014 2:10:49 PM UTC-7, Michael McCandless wrote:
Hmm link doesn't seem to work?
Mike McCandless
http://blog.mikemccandless.com
On Wed, Jul 9, 2014 at 4:06 PM, Kireet Reddy kir...@feedly.com wrote:
Sorry, forgot the link
Dropbox - Error - Simplify your life
1229.txtOn Wednesday, July 9, 2014 1:05:56 PM UTC-7, Kireet Reddy wrote:
The problem is happening again, this time on node 5. I have captured a few
hot thread requests here. I also included one from node 6 (which is now
fine).There are merge related stacks, but it seems like everything is
taking a lot more cpu than usual. I did a few type=wait and type=block
dumps and the result was always 0% usage there. Also young gen gc activity
has again gone crazy (old gen/heap size seems fine). Would hot thread
measurements be distorted if gc activity is very high?It seems strange to me that this would only happen on one node while we
have replica set to at least 1 for all our indices. It seems like the
problems should happen on a couple nodes simultaneously.--Kireet
On Monday, July 7, 2014 3:51:35 PM UTC-7, Michael McCandless wrote:
Could you pull all hot threads next time the problem happens?
Mike McCandless
http://blog.mikemccandless.com
On Mon, Jul 7, 2014 at 3:47 PM, Kireet Reddy kir...@feedly.com wrote:
All that seems correct (except I think this is for node 6, not node 5).
We don't delete documents, but we do some updates. The vast majority of
documents get indexed into the large shards, but the smaller ones take some
writes as well.We aren't using virtualized hardware and elasticsearch is the only thing
running on the machines, no scheduled jobs, etc. I don't think something is
interfering, actually overall disk i/o rate and operations on the machine
go down quite a bit during the problematic period, which is consistent with
your observations about things taking longer.I went back and checked all our collected metrics again. I noticed that
even though the heap usage and gc count seems smooth during the period in
question, gc time spent goes way up. Also active indexing threads goes up,
but since our ingest rate didn't go up I assumed this was a side effect.
During a previous occurrence a few days ago on node5, I stopped all
indexing activity for 15 minutes. Active merges and indexing requests went
to zero as expected. Then I re-enabled indexing and immediately the
increased cpu/gc/active merges went back up to the problematic rates.Overall this is pretty confusing to me as to what is a symptom vs a root
cause here. A summary of what I think I know:
- Every few days, cpu usage on a node goes way above the other nodes
and doesn't recover. We've let the node run in the elevated cpu state for a
day with no improvement.- It doesn't seem likely that it's data related. We use replicas=1
and no other nodes have issues.- It doesn't seem hardware related. We run on a dedicated h/w with
elasticsearch being the only thing running. Also the problem appears on
various nodes and machine load seems tied directly to the elasticsearch
process.- During the problematic period: cpu usage, active merge threads,
active bulk (indexing) threads, and gc time are elevated.- During the problematic period: i/o ops and i/o throughput decrease.
- overall heap usage size seems to smoothly increase, the extra gc
time seems to be spent on the new gen. Interestingly, the gc count didn't
seem to increase.- In the hours beforehand, gc behavior of the problematic node was
similar to the other nodes.- If I pause indexing, machine load quickly returns to normal, merges
and indexing requests complete. if I then restart indexing the problem
reoccurs immediately.- If I disable automatic refreshes, the problem disappears within an
hour or so.- hot threads show merging activity as the hot threads.
The first few points make me think the increased active merges is perhaps
a side effect, but then the last 3 make me think merging is the root cause.
The only additional things I can think of that may be relevant are:
- Our documents can vary greatly in size, they average a couple KB
but can rarely be several MB.- we do use language analysis plugins, perhaps one of these is acting
up?- We eagerly load one field into the field data cache. But the cache
size is ok and the overall heap behavior is ok so I don't think this is the
problem.That's a lot of information, but I am not sure where to go next from
here...On Monday, July 7, 2014 8:23:20 AM UTC-7, Michael McCandless wrote:
Indeed there are no big merges during that time ...
I can see on node5, ~14:45 suddenly merges are taking a long time, refresh
is taking much longer (4-5 seconds instead of < .4 sec), commit time goes
up from < 0.5 sec to ~1-2 sec, etc., but other metrics are fine e.g. total
merging GB, number of commits/refreshes is very low during this time.Each node has 2 biggish (~17 GB) shards and then ~50 tiny shards. The
biggish shards are indexing at a very slow rate and only have ~1%
deletions. Are you explicitly deleting docs?I suspect something is suddenly cutting into the IO perf of this box, and
because merging/refreshing is so IO intensive, it causes these operations
to run slower / backlog.Are there any scheduled jobs, e.g. backups/snapshots, that start up? Are
you running on virtualized hardware?Mike McCandless
http://blog.mikemccandless.com
On Sun, Jul 6, 2014 at 8:23 PM, Kireet Reddy kir...@feedly.com wrote:
Just to reiterate, the problematic period is from 07/05 14:45 to 07/06
02:10. I included a couple hours before and after in the logs.On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote:
They are linked below (node5 is the log of the normal node, node6 is the
log of the problematic node).I don't think it was doing big merges, otherwise during the high load
period, the merges graph line would have had a "floor" > 0, similar to the
time period after I disabled refresh. We don't do routing and use mostly
default settings. I think the only settings we changed are:indices.memory.index_buffer_si*<*
...
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/91396be4-117c-45ea-bb9f-aa5cf95fa6bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.