Merge policy and segments count


(Raul Kaubi) #1

Hi

Setting up 6.5.x cluster. So I need to index few million docs for this.

I have disabled index refresh during initial indexing. All indices have 5 shards.

Question: If doing inital indexing, without refresh, are these segments sized good enough..?
Or are there some thumb rules, that let's say:

  • "single segment should not be greater than X"
  • "or segment count per shard should not be greater than Y"
  • "or are there some ratios.."

At the moment, it looks like this. Bear in mind, at the moment I am doing initial indexing to index named index7. Other indices are pretty much going to me same size as present time.

curl -sXGET `hostname`':9200/_cat/indices?v&s=health,status,index:desc'
health status index     pri rep docs.count docs.deleted store.size pri.store.size
green  open   index1    5   1      23907            0      2.9gb          1.4gb
green  open   index2    5   1        769            0    548.2kb          274kb
green  open   index3    5   1          1            0    394.3kb        197.1kb
green  open   index4   	5   1       1259            0      1.4mb        760.3kb
green  open   index5    5   1      13372            0     13.5mb          6.7mb
green  open   index6    5   1     533808            0      105gb         52.5gb
green  open   index7    5   0    1953725         2374      3.3gb          3.3gb <-- currently indexing this, total would be 5gb per node, so with replica, it sums up to 10gb, refresh_interval = -1 at the moment

The default merge policy is as follows for all the indices:

  "merge": {
    "scheduler": {
      "max_thread_count": "1",
      "auto_throttle": "true",
      "max_merge_count": "6"
    },
    "policy": {
      "reclaim_deletes_weight": "2.0",
      "floor_segment": "2mb",
      "max_merge_at_once_explicit": "30",
      "max_merge_at_once": "10",
      "max_merged_segment": "5gb",
      "expunge_deletes_allowed": "10.0",
      "segments_per_tier": "10.0",
      "deletes_pct_allowed": "33.0"
    }

Should I change anything regarding some of these indices policies..?
What I mean is, some indices are going to be quite small compared to others, and probably default policy will not work for all the indices.

Regards
Raul


(Raul Kaubi) #2

Also

How it would influence search and indexing operations when I have:

  • too much too little segments..?
  • too few very big segments

And what would be the optimal segments count, for the previously listed indices (1-7).

Thanks.

Regards
Raul


(David Turner) #3

I think you probably want fewer shards. "A few million docs" is not very many. I would have thought that one shard per index would be fine.

Almost certainly not, and certainly not without experiments showing that any changes make a reliable improvement. The defaults are pretty good, and it's easy to make things much worse.


(Raul Kaubi) #4

... I would have thought that one shard per index would be fine.

Even for the one, that has 50+gb of data..?

Also, any thoughts about this..?

How it would influence search and indexing operations when I have:

  • too much too little segments..?
  • too few very big segments

And what would be the optimal segments count, for the previously listed indices (1-7).

Thanks.

Regards
Raul


(David Turner) #5

You must benchmark your system and workload to be sure, but 50GB is a reasonable size for a shard, and 5*10GB shards sounds like too many too-small shards.

Maybe this article will help: https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster

My main thought is that I don't understand your level of concern at this stage in your project. This kind of low-level fine-tuning is only really feasible with careful experimentation with realistic workloads to show that you can make a significant improvement. Even then, simply going from 1 replica to 2 will give you 50% more search power, so that's the first thing I'd try if I had proved that I needed more performance.


(Raul Kaubi) #6

Ok, thanks, I will try with the 1 shard then.
But, for the 50-60gb index, what do you think, what would be the average segment count per shard (if one shard) with the default policy settings..?

Also, I am having difficulties understanding merge policy settings, is it perhaps possible to explain with few words, what these actually mean, since I did not find good documentation for these.

  "merge": {
    "scheduler": {
      "max_thread_count": "1",
      "auto_throttle": "true",
      "max_merge_count": "6"
    },
    "policy": {
      "reclaim_deletes_weight": "2.0",
      "floor_segment": "2mb",
      "max_merge_at_once_explicit": "30",
      "max_merge_at_once": "10",
      "max_merged_segment": "5gb",
      "expunge_deletes_allowed": "10.0",
      "segments_per_tier": "10.0",
      "deletes_pct_allowed": "33.0"
    } 

Thanks.

Regards
Raul


(David Turner) #7

I don't know. Is it important to know this? Why?

They are documented in the source code so that developers can experiment with them, but the lack of user-facing documentation is deliberate: a statement that the defaults are the right choice.


(Raul Kaubi) #8

Ok, yes, I found some documentation about that from MergePolicyConfig.java file.

So I am experimenting with single shard at the moment, for initial indexing, I have disabled refresh ("index.refresh_interval": "-1"), but I am still getting some very tiny segments. I read from somewhere, that if index refresh is disabled, then I won't get these tiny segments..

index      shard prirep ip        segment generation docs.count docs.deleted    size size.memory committed searchable version compound
indexX     0     p      x.y.z.c   _1               1      50484          345  85.2mb      217759 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _2               2         75            0 144.5kb        5460 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _5               5        113            0 288.7kb        8532 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _6               6      41508           29  59.5mb      162002 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _8               8        192            0 440.4kb       11564 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _9               9          8            0  29.1kb        3444 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _c              12         16            0  59.2kb        4140 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _d              13      51547           30  79.6mb      197642 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _f              15         75            0 148.3kb        5735 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _g              16      47746           42  67.8mb      167895 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _h              17      60297           40  85.5mb      195991 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _i              18        175            0 412.4kb       11075 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _k              20      54177           35  77.7mb      183248 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _l              21        125            0 247.7kb        7741 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _n              23      54035           74  79.4mb      188466 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _o              24         25            0 107.6kb        4764 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _p              25      51732           83  78.5mb      184088 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _r              27        174            0 483.1kb       11241 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _s              28     520174            0 719.2mb      858833 true      true       7.5.0   false
indexX     0     p      x.y.z.c   _t              29      53718          158  80.6mb      181491 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _u              30      47322           70  70.7mb      166145 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _v              31         48            0 146.8kb        5879 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _w              32      44890          171    71mb      179983 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _x              33      54042          138  81.2mb      177416 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _y              34        100            0 356.2kb        8526 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _z              35      48137           72  76.6mb      170435 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _10             36      47828           43  75.1mb      172433 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _11             37        150            0 552.9kb       11773 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _12             38      46300          111  73.3mb      183148 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _13             39      48582          101  79.9mb      186582 true      true       7.5.0   true
indexX     0     p      x.y.z.c   _14             40        149            0 284.4kb        9282 false     true       7.5.0   true

Bear in mind that this above initial indexing takes 1-2 days.

Also I have discovered, if initial indexing takes little time (max 1 hour), then I'll get even sized segments, for example, this took about 20 minutes.

index   shard prirep ip        segment generation docs.count docs.deleted    size size.memory committed searchable version compound
indexY	0     p      x.y.z.c   _0               0       5108            0 420.6mb       69735 true      true       7.5.0   true
indexY	0     p      x.y.z.c   _1               1       6955            0 400.9mb       72663 true      true       7.5.0   true
indexY	0     p      x.y.z.c   _2               2       5915            0 387.8mb       74351 true      true       7.5.0   true
indexY	0     p      x.y.z.c   _3               3       5939            0   312mb       69754 true      true       7.5.0   true
indexY	0     r      x.y.z.c   _0               0       5108            0 420.6mb       69735 true      true       7.5.0   true
indexY	0     r      x.y.z.c   _1               1       6955            0 400.9mb       72663 true      true       7.5.0   true
indexY	0     r      x.y.z.c   _2               2       5915            0 387.8mb       74351 true      true       7.5.0   true
indexY	0     r      x.y.z.c   _3               3       5939            0   312mb       69904 true      true       7.5.0   true

Regards
Raul