Merge policy and segments count

Hi

Setting up 6.5.x cluster. So I need to index few million docs for this.

I have disabled index refresh during initial indexing. All indices have 5 shards.

Question: If doing inital indexing, without refresh, are these segments sized good enough..?
Or are there some thumb rules, that let's say:

  • "single segment should not be greater than X"
  • "or segment count per shard should not be greater than Y"
  • "or are there some ratios.."

At the moment, it looks like this. Bear in mind, at the moment I am doing initial indexing to index named index7. Other indices are pretty much going to me same size as present time.

curl -sXGET `hostname`':9200/_cat/indices?v&s=health,status,index:desc'
health status index     pri rep docs.count docs.deleted store.size pri.store.size
green  open   index1    5   1      23907            0      2.9gb          1.4gb
green  open   index2    5   1        769            0    548.2kb          274kb
green  open   index3    5   1          1            0    394.3kb        197.1kb
green  open   index4   	5   1       1259            0      1.4mb        760.3kb
green  open   index5    5   1      13372            0     13.5mb          6.7mb
green  open   index6    5   1     533808            0      105gb         52.5gb
green  open   index7    5   0    1953725         2374      3.3gb          3.3gb <-- currently indexing this, total would be 5gb per node, so with replica, it sums up to 10gb, refresh_interval = -1 at the moment

The default merge policy is as follows for all the indices:

  "merge": {
    "scheduler": {
      "max_thread_count": "1",
      "auto_throttle": "true",
      "max_merge_count": "6"
    },
    "policy": {
      "reclaim_deletes_weight": "2.0",
      "floor_segment": "2mb",
      "max_merge_at_once_explicit": "30",
      "max_merge_at_once": "10",
      "max_merged_segment": "5gb",
      "expunge_deletes_allowed": "10.0",
      "segments_per_tier": "10.0",
      "deletes_pct_allowed": "33.0"
    }

Should I change anything regarding some of these indices policies..?
What I mean is, some indices are going to be quite small compared to others, and probably default policy will not work for all the indices.

Regards
Raul

Also

How it would influence search and indexing operations when I have:

  • too much too little segments..?
  • too few very big segments

And what would be the optimal segments count, for the previously listed indices (1-7).

Thanks.

Regards
Raul

I think you probably want fewer shards. "A few million docs" is not very many. I would have thought that one shard per index would be fine.

Almost certainly not, and certainly not without experiments showing that any changes make a reliable improvement. The defaults are pretty good, and it's easy to make things much worse.

1 Like

... I would have thought that one shard per index would be fine.

Even for the one, that has 50+gb of data..?

Also, any thoughts about this..?

How it would influence search and indexing operations when I have:

  • too much too little segments..?
  • too few very big segments

And what would be the optimal segments count, for the previously listed indices (1-7).

Thanks.

Regards
Raul

You must benchmark your system and workload to be sure, but 50GB is a reasonable size for a shard, and 5*10GB shards sounds like too many too-small shards.

Maybe this article will help: How many shards should I have in my Elasticsearch cluster? | Elastic Blog

My main thought is that I don't understand your level of concern at this stage in your project. This kind of low-level fine-tuning is only really feasible with careful experimentation with realistic workloads to show that you can make a significant improvement. Even then, simply going from 1 replica to 2 will give you 50% more search power, so that's the first thing I'd try if I had proved that I needed more performance.

Ok, thanks, I will try with the 1 shard then.
But, for the 50-60gb index, what do you think, what would be the average segment count per shard (if one shard) with the default policy settings..?

Also, I am having difficulties understanding merge policy settings, is it perhaps possible to explain with few words, what these actually mean, since I did not find good documentation for these.

  "merge": {
    "scheduler": {
      "max_thread_count": "1",
      "auto_throttle": "true",
      "max_merge_count": "6"
    },
    "policy": {
      "reclaim_deletes_weight": "2.0",
      "floor_segment": "2mb",
      "max_merge_at_once_explicit": "30",
      "max_merge_at_once": "10",
      "max_merged_segment": "5gb",
      "expunge_deletes_allowed": "10.0",
      "segments_per_tier": "10.0",
      "deletes_pct_allowed": "33.0"
    } 

Thanks.

Regards
Raul

I don't know. Is it important to know this? Why?

They are documented in the source code so that developers can experiment with them, but the lack of user-facing documentation is deliberate: a statement that the defaults are the right choice.

Ok, yes, I found some documentation about that from MergePolicyConfig.java file.

So I am experimenting with single shard at the moment, for initial indexing, I have disabled refresh ("index.refresh_interval": "-1"), but I am still getting some very tiny segments. I read from somewhere, that if index refresh is disabled, then I won't get these tiny segments..

index      shard prirep ip        segment generation docs.count docs.deleted    size size.memory committed searchable version compound
indexX     0     p      x.y.z.c   _1               1      50484          345  85.2mb      217759 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _2               2         75            0 144.5kb        5460 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _5               5        113            0 288.7kb        8532 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _6               6      41508           29  59.5mb      162002 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _8               8        192            0 440.4kb       11564 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _9               9          8            0  29.1kb        3444 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _c              12         16            0  59.2kb        4140 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _d              13      51547           30  79.6mb      197642 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _f              15         75            0 148.3kb        5735 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _g              16      47746           42  67.8mb      167895 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _h              17      60297           40  85.5mb      195991 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _i              18        175            0 412.4kb       11075 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _k              20      54177           35  77.7mb      183248 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _l              21        125            0 247.7kb        7741 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _n              23      54035           74  79.4mb      188466 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _o              24         25            0 107.6kb        4764 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _p              25      51732           83  78.5mb      184088 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _r              27        174            0 483.1kb       11241 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _s              28     520174            0 719.2mb      858833 true      true       7.5.0   false
indexX     0     p      x.y.z.c   _t              29      53718          158  80.6mb      181491 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _u              30      47322           70  70.7mb      166145 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _v              31         48            0 146.8kb        5879 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _w              32      44890          171    71mb      179983 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _x              33      54042          138  81.2mb      177416 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _y              34        100            0 356.2kb        8526 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _z              35      48137           72  76.6mb      170435 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _10             36      47828           43  75.1mb      172433 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _11             37        150            0 552.9kb       11773 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _12             38      46300          111  73.3mb      183148 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _13             39      48582          101  79.9mb      186582 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _14             40        149            0 284.4kb        9282 false     true       7.5.0   true

Bear in mind that this above initial indexing takes 1-2 days.

Also I have discovered, if initial indexing takes little time (max 1 hour), then I'll get even sized segments, for example, this took about 20 minutes.

index   shard prirep ip        segment generation docs.count docs.deleted    size size.memory committed searchable version compound
indexY	0     p      x.y.z.c   _0               0       5108            0 420.6mb       69735 true      true       7.5.0   true
indexY	0     p      x.y.z.c   _1               1       6955            0 400.9mb       72663 true      true       7.5.0   true
indexY	0     p      x.y.z.c   _2               2       5915            0 387.8mb       74351 true      true       7.5.0   true
indexY	0     p      x.y.z.c   _3               3       5939            0   312mb       69754 true      true       7.5.0   true
indexY	0     r      x.y.z.c   _0               0       5108            0 420.6mb       69735 true      true       7.5.0   true
indexY	0     r      x.y.z.c   _1               1       6955            0 400.9mb       72663 true      true       7.5.0   true
indexY	0     r      x.y.z.c   _2               2       5915            0 387.8mb       74351 true      true       7.5.0   true
indexY	0     r      x.y.z.c   _3               3       5939            0   312mb       69904 true      true       7.5.0   true

Regards
Raul

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.