Confused about Segments: Searchable vs Committed vs Uncommitted

Thanks for the help Simon! I've spent all morning digging through the ES
source code (segments method in RobinEngine.javahttps://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/index/engine/robin/RobinEngine.java#L1161)
trying to fully wrap my head around what's going on. Can you confirm that
my logic is now correct?

  1. Foreach segment in NRT IndexReader (because SearchManager was built
    with an IndexWriter instead of Directory)
    • segment.search = true
    • Add to segments
  2. Foreach segment in lastCommittedSegmentInfos
    • If not in segments
      • search = false
      • committed = true
      • Add to segments
    • Else
      • committed = true
      • (and search = true because it was set in step 1)

Step 1 finds all segments that exist in the NRT IndexReader, marks them
{search:true}. Step 2 finds all segments at the last Commit point, and if
they are not in our segment list yet mark them {search:false,
committed:true}. If they do exist, they are both in the IndexReader and
committed, so mark both search/committed as true;

I can see why "Searchable" is a misnomer then, what it really refers to is
whether the segmented is represented in the IndexReader yet. Committed
simply refers to the "Lucene Committed" status.

Thanks for the help!
-Zach

On Wednesday, February 20, 2013 5:50:37 PM UTC-5, simonw wrote:

hey,

I didn't go through all your graphs but let me explain how this works on a
lower level in lucene... so an index (lucene index) consists of multiple
segments. Segments are written by flushing ram buffers to disk (lucene
flush not ES) or by merge processes. Now if you commit a lucene index you

  1. flush everything to disk and 2. write a commit point (listing all
    segments belong to this commit) 3. calls fsync. If you open a new
    IndexReader on this commit all its segments are "searchable"
    Now ES uses a feature called NRT (near realtime) that is similar to a
    commit since it flushes to disc (ES refresh) but doesn't fsync nor does it
    write a commit point. You can open a NRT Reader on top of an uncommitted
    index so those segments can be searchable (not sure if I like this term). I
    think the "uncommitted" part corresponds to not yet flushed into a segment
    on the lucene level which means its still in memory and written to the
    translog.

hope this clarifies it a bit.

simon

On Wednesday, February 20, 2013 5:09:00 PM UTC+1, Zachary Tong wrote:

One-week-later bump to see if anyone can clear this up for me =)

Cheers,
-Zach

On Thursday, February 14, 2013 9:19:57 AM UTC-5, Zachary Tong wrote:

Hmm, perhaps I'm misunderstanding the results of the Segments API. I
assumed that:

search: true && committed: true == Searchable
search: false && committed: true == Committed
search: false && committed: false == Uncommitted
search: true && committed: false == Uncommitted (although I'm not sure
this case ever happens...searchable but not committed?)

-Zach

On Thursday, February 14, 2013 9:12:55 AM UTC-5, Zachary Tong wrote:

I'll preface this question by saying it is purely academic. I thought
these terms meant one thing, but upon watching a live index I'm no longer
sure. Are the following definitions correct?

  • Searchable: Segment is on disk as a Lucene segment and is marked
    as searchable. A segment is marked searchable by the periodic
    refresh_interval or by the Refresh API.
  • Committed: Segment is on disk as a Lucene segment, but has not
    been marked as searchable yet. A segment is committed to disk when the
    translog defaults are reached (5000 ops, 200mb or 30min, whichever comes
    first), or by the Flush API.
  • Uncommitted: Segment (or operations?) live only in the translog
    and have no been written to a Lucene segment yet.

With those definitions in mind, I started looking at a live index
(default settings) and was surprised to see something like this:

https://lh3.googleusercontent.com/-k0UHo6tSpCM/URzuKlNIcMI/AAAAAAAABEQ/IOLWzmm114k/s1600/segments1.PNG

I verified these graphs with the raw Segments API, to make sure it
wasn't my plugin that was being odd. The presence of very large
Uncommitted segments (1 million docs) that are very long lived (_130 at the
far left was very old and persistent) confuses me. Ditto for Committed
segments...shouldn't those be changed search:true every second under
default settings?
I created a simpler index with 1 shard, 0 replicas and repeated the
experiment:

https://lh6.googleusercontent.com/-rLao-if5yGM/URzu4HHtJZI/AAAAAAAABEY/S1EBOsI_fMQ/s1600/segments_refreshinterval-1s.PNG
The results are similar, where there are relatively large segments
(>5000 translog limit) that remain uncommitted. This is under heavy
indexing load from a JMeter benchmark. If I stop the indexing, this index
will eventually switch over to fully Searchable. Even stranger, if I add a
replica while continuing to index, the segments are all marked searchable
as soon as the replica is initialized:

https://lh6.googleusercontent.com/-M_UjylMUvWg/URzverP3DwI/AAAAAAAABEo/7QlyLE05x0I/s1600/segments_refreshinterval-1s.replica.PNG

After seeing these graphs, I'm convinced I have no idea what a
Searchable/Committed/Uncommitted segment actually is. Could someone shed
some light on where I'm misunderstanding? Thanks!
-Zach

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.