Cannot reduce number of segments during indexing

I have followed the advice in the aKNN tuning guide:

But no matter the settings, the indexing process still creates a huge tail of tiny segments.
Setup:

  • New dev deployment
  • Zero search traffic
  • 64GB, CPU optimized
  • "indices.memory.index_buffer_size": "10%"
  • "index.translog.flush_threshold_size": "10gb"
  • "index.refresh_interval": "-1"

On a 64GB node, I understand half is allocated to the JVM heap, so we have 32GB of memory. If index_buffer_size is 10%, then that should be 3.2GB of buffer.

With these settings, how in the world do I end up with literally dozens of segments that are <100MB ???

Here's an example of one shard's segments:

s segment     size docs.count
0 _co2     510.6kb         31
0 _cou     716.3kb         44
0 _coq       1.7mb        111
0 _cot       3.1mb        199
0 _cop       5.5mb        354
0 _cnx       5.8mb        376
0 _co7       7.3mb        471
0 _co6       8.4mb        542
0 _cor      11.7mb       1986
0 _co1      12.2mb       1972
0 _cos      13.7mb        881
0 _cog      14.1mb        907
0 _cnv      15.3mb        981
0 _coo      18.3mb       1175
0 _cow      20.2mb       1296
0 _com      31.6mb       2030
0 _cov      40.1mb       2577
0 _cod      81.5mb       5234
0 _coe      84.8mb       5453
0 _col      87.7mb       5641
0 _cmv     106.6mb       6852
0 _cok     109.5mb       7040
0 _cja       131mb       8419
0 _coj     150.7mb      30500
0 _cob     335.3mb      54380
0 _cjr       584mb      48689
0 _cmn     707.9mb      59709
0 _cgn     732.6mb     198037
0 _ckb     868.9mb     241883
0 _c8x     962.4mb      90502
0 _c60    1021.7mb      97412
0 _cn6       1.3gb     377820
0 _2ut       1.4gb     393405
0 _a3v       1.5gb     438436
0 _bcu       1.6gb     469321
0 _7e4       1.6gb     483486
0 _2oa       1.7gb     492613
0 _8rc       1.7gb     500500
0 _32p       1.8gb     499522
0 _a7z       1.9gb     546017
0 _bzy       1.9gb     556732
0 _73j         2gb     580960
0 _4o6         2gb     592797
0 _6k3       2.3gb     660575
0 _9t0       2.4gb     676770
0 _5lu       2.5gb     723636
0 _7oi       2.6gb     752804
0 _57c       2.6gb     756930
0 _aq8       2.6gb     762371
0 _96j       2.7gb     776362
0 _3k1       2.8gb     795040
0 _ccf       2.8gb     816245
0 _wp        3.1gb     889646
0 _axy       3.1gb     892622
0 _bng       3.3gb     921377
0 _xc        3.4gb     972174
0 _9ai       3.8gb    1103511
0 _628         4gb    1142051
0 _310       4.2gb    1199113
0 _agl       4.5gb    1284416
0 _9td       4.5gb    1299733
0 _67z       4.6gb    1317589
0 _86m       4.6gb    1316734
0 _4xf       4.7gb    1353413
0 _c12       4.7gb    1351662
0 _75i       4.8gb    1362717
0 _bcn       4.8gb    1372132
0 _428       4.9gb    1394784
0 _83p       4.9gb    1404350

How are you indexing into Elasticsearch? It is a long shot, but verify you are not passing in a refresh parameter when indexing, as this is one reason you could end up with a lot of small segments?

Great question! We're using the streaming_bulk helper from the python client:

streaming_bulk(
    client=es,
    actions=index_actions,
    chunk_size=150,
    max_retries=3,
    initial_backoff=1,
    yield_ok=False,
    raise_on_error=False,
    raise_on_exception=False,
)

And the index actions are constructed as:

{
    "_index": index_name,
    "_source": document_dict,
}

Looking back at this, our chunk_size is quite small (iirc this was a throughput issue, but worth reassessing). Regardless, with a large index_buffer_size, I don't think the chunk_size of our requests are relevant.

What do you think?

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.