Why do the docs.count and docs.deleted of the generated segments change

Background: In version 7.6.0 of ES, an external client is continuously executing update_by_query on an index.

Phenomenon: At this time, I found through /_cat/segments that the docs.count and docs.deleted of many existing segments in the index are constantly changing.

For example, the segment information is as follows

 segment generation docs.count docs.deleted     size size.memory committed searchable version compound
 _mqg         29464    2683624       802282    2.5gb     2119594 true      true       8.4.0   false
 _oxd         32305    1250591       632447    1.3gb     1138511 true      true       8.4.0   false
 _oo5         31973    1434271       472773    1.3gb     1158523 true      true       8.4.0   false
 _v6k         40412    1209509       320023    1.1gb      891519 true      true       8.4.0   false
 _1w5          2453     539240       284964  629.9mb      584760 true      true       8.4.0   false
 _v0q         40202     982360       266487  929.5mb      733621 true      true       8.4.0   false
 _20b          2603    1367294       225623    1.1gb     1019214 true      true       8.4.0   false
 _bu7         15343    1144383       210547 1007.5mb      846393 true      true       8.4.0   false
 _733          9183    1875493       166523    1.5gb     1323250 true      true       8.4.0   false

After a few seconds, the segment information is as follows

 segment generation docs.count docs.deleted     size size.memory committed searchable version compound
 _mqg         29464    2683135       802771    2.5gb     2119594 true      true       8.4.0   false
 _oxd         32305    1250591       632447    1.3gb     1138511 true      true       8.4.0   false
 _oo5         31973    1434271       472773    1.3gb     1158523 true      true       8.4.0   false
 _v6k         40412    1208615       320917    1.1gb      891519 true      true       8.4.0   false
 _1w5          2453     537834       286370  629.9mb      584760 true      true       8.4.0   false
 _v0q         40202     973870       274977  929.5mb      733621 true      true       8.4.0   false
 _20b          2603    1361957       230960    1.1gb     1019214 true      true       8.4.0   false
 _bu7         15343    1144383       210547 1007.5mb      846393 true      true       8.4.0   false
 _733          9183    1870996       171020    1.5gb     1323250 true      true       8.4.0   false

The docs.count and docs.deleted of segments such as _mqg, _v6k, _1w5, etc. have changed.

Question: I remember that the information of segments that have been flushed in Lucene is immutable, and when a segment has been flushed, new data should go to a new segment.
Why does the above phenomenon exist?

Each updated document results in a new document in a new segment at the same time the old version is marked as deleted in the old segment and therefore shows up as a delete. When a merge occurs deleted documents are removed.

1 Like

Interestingly in this specific case , where f1 is the first output pasted above and f2 the second

% awk 'NR>1{count+=$3;deleted+=$4} END{print count,deleted,count+deleted}' f1
12486765 3381669 15868434

% awk 'NR>1{count+=$3;deleted+=$4} END{print count,deleted,count+deleted}' f2
12465652 3402782 15868434

So docs.count + docs.deleted has remained constant at index level (assuming no more segments were part of the index). And same on per segment basis.

awk '{$3+$4>0 ? sum=$3+$4 : sum="docs+deleted";printf "%2s %8s %16s %16s %16s\n",FILENAME,$1,$3,$4,sum}' f1 f2 | sort -s -rk2.2
f1  segment       docs.count     docs.deleted     docs+deleted
f2  segment       docs.count     docs.deleted     docs+deleted
f1     _v6k          1209509           320023          1529532
f2     _v6k          1208615           320917          1529532
f1     _v0q           982360           266487          1248847
f2     _v0q           973870           274977          1248847
f1     _oxd          1250591           632447          1883038
f2     _oxd          1250591           632447          1883038
f1     _oo5          1434271           472773          1907044
f2     _oo5          1434271           472773          1907044
f1     _mqg          2683624           802282          3485906
f2     _mqg          2683135           802771          3485906
f1     _bu7          1144383           210547          1354930
f2     _bu7          1144383           210547          1354930
f1     _733          1875493           166523          2042016
f2     _733          1870996           171020          2042016
f1     _20b          1367294           225623          1592917
f2     _20b          1361957           230960          1592917
f1     _1w5           539240           284964           824204
f2     _1w5           537834           286370           824204
1 Like

I haven't thought of using this method for statistics, good idea.

No problem, awk is fantastic for such little things. If you are me at least. :face_with_raised_eyebrow: