How Elasticsearch Index (Lucene) works under the covers?

Hello ES gurus,
I have been testing ES index by indexing a bunch of JSON files containing many documents (> 1000 documents per file). I have preprocessed these documents with the Bulk API header ({"index":{}}) so I can run the bulk index in parallel using xargs. So during the indexing operation, I noticed the following from the space consumption perspective.

I ran the /_cat/indices?v and saw the pri.store.size starting to consume storage which is normal but then after a few minutes, I saw pri.store.size reduced in used capacity. Does this mean that ES is compressing the segments or the segments are being merged?

For example:
Wed Apr 24 18:11:25 UTC 2024
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open das_index YanBOYiXSw2hvKyOtjW2PQ 1 0 7044170 0 10.6gb 10.6gb 10.6gb
Wed Apr 24 18:11:31 UTC 2024
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open das_index YanBOYiXSw2hvKyOtjW2PQ 1 0 7044170 0 10.6gb 10.6gb 10.6gb
Wed Apr 24 18:11:36 UTC 2024
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open das_index YanBOYiXSw2hvKyOtjW2PQ 1 0 7044170 0 11.4gb 11.4gb 11.4gb
Wed Apr 24 18:11:41 UTC 2024
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open das_index YanBOYiXSw2hvKyOtjW2PQ 1 0 8583386 0 11.4gb 11.4gb 11.4gb
Wed Apr 24 18:11:46 UTC 2024
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open das_index YanBOYiXSw2hvKyOtjW2PQ 1 0 8583386 0 9.6gb 9.6gb 9.6gb
Wed Apr 24 18:11:51 UTC 2024
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open das_index YanBOYiXSw2hvKyOtjW2PQ 1 0 8583386 0 9.6gb 9.6gb 9.6gb
Wed Apr 24 18:11:56 UTC 2024
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open das_index YanBOYiXSw2hvKyOtjW2PQ 1 0 8583386 0 10.3gb 10.3gb 10.3gb
Wed Apr 24 18:12:01 UTC 2024
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size dataset.size
green open das_index YanBOYiXSw2hvKyOtjW2PQ 1 0 9082633 0 10.3gb 10.3gb 10.3gb

I mean, how does ES do its compression in terms of when? BTW, I created my index using default values (no special settings).

Another question I want to ask is, does ES (Lucene Index) take up additional space when converting the index into inverted index and going through the processing steps such as tokenization and text normalization, etc.?

Any insights from the ES gurus is much appreciated. I'd like to understand how Lucene index work under the covers and how it would affect the space usage in general.

Thanks!

That's correct.
You can watch this I believe by using this API: Index segments API | Elasticsearch Guide [8.13] | Elastic or cat segments API | Elasticsearch Guide [8.13] | Elastic

Also some information from Size your shards | Elasticsearch Guide [8.13] | Elastic

Segments play a big role in a shard’s resource usage. Most shards contain several segments, which store its index data. Elasticsearch keeps some segment metadata in heap memory so it can be quickly retrieved for searches. As a shard grows, its segments are merged into fewer, larger segments. This decreases the number of segments, which means less metadata is kept in heap memory.

Every mapped field also carries some overhead in terms of memory usage and disk space.

There are also nice videos which explain what is happening at merge time here: Changing Bits: Visualizing Lucene's segment merges

Thanks, David!