Extremely increased disk usage since upgrade to 8.17.1

Hi, we upgraded our production stack to 8.17.1 at the beginning of the week.

Now after it ran for a few days the amount of data (especially Kubernetes cluster metrics) has been increased enormously since the upgrade.

Please fokus on the index creation date and the amount of documents in comparison to the storage size.

Have there been any changes affecting the indexing or the data collection in general? I've reviewed the changelogs (both agent and elasticsearch) and could not find any significant changes regarding this.

Also I want to make clear that this is urgent for us, as we had data size increase of over 500GB in 2 days now, which is pretty unusual, as there have been made no changes to ILM etc.

Any advice from the team would be gladly appreciated. Thanks

1 Like

I do not know the answer but suspect it would help to know which version you upgraded from.

Ah thanks for reminding. I forgot to include. We upgraded from 8.16.1 (both agents and Elasticsearch).

Have you been using synthetic source? There is a note in the release notes about it that could result in increased storage size if you are on the basic license.

No we dont use synthetic at all by now. We just used it for evaluation but synthetic creates separate synthetic-* data streams and does not touch any metrics-* datastreams.

Or does that apply for all indices? Then the naming is very unfortunate.

Check the mappings for the indices where size has changed.

I have checked the Fleet managed index settings. The old ones use the synthetic mode, while the new ones dont support it anymore. But just making sure: Do the default fleet dashboards require the _source field or could we disable it.

Secondly the docs say "increased storage requirements" but not 500% or more increase which could be told specifically as we would have accepted e.g. 50% increase.
image

edit: According to other indices disk usage, the increase is rather around 1000%

I have not used synthetic source so do not know how much impact it has. It may be worthwhile chacking if there are any other changes in the mappings. There may be other changes as well, but that was one I quickly found. Maybe others have experienced similar issues and have some feedback.

Unfortunately according to the docs, disabling the source field, prevents Elasticsearch from doing reindex tasks needed when performing a major upgrade (1 version backwards compatibility): _source field | Elasticsearch Guide [8.17] | Elastic

Also, if you use Kibana you cannot disable the _source field as it will break a lot of things.

Your issue is probably related to the license change for the synthetic source, before version 8.17.1 using the synthetic source didn't require a license and, if I'm not wrong, the metrics data stream used the synthetic source per default from a long time.

Now with the license change you cannot use the synthetic source anymore, and this will lead to a disk increase usage.

Can you provide some evidence of this? This may be another issue, without the synthetic source you may expect an storage increase of 50% or maybe a little more, but I would not expect that it would even double the size required.

Thanks for the reply.

As you can see in the attached screenshot, if you compare the value of storage per document, you can calculate a rate of approximate 1000% in comparison to the old indices.

In detail:

old index: 13,24 GiB * 1024 * 1024 * 1024 / 47937749 ~= 297 Bytes/Document

new Index: 20,53 GiB * 1024 * 1024 * 1024 / 7241436 ~= 3044 Bytes/Document

Percentage: 3044/297 ~= 1025%

I can confirm similar results in other indices as the above shown have already been manually deleted to free storage.

As additional info: The upgrade was performed on the 10th Feb. 2025. Therefore the index with the 10th as timestamp has the new settings, while the one wth Feb. 4th still contains old settings.

Im not sure if the synthetic _source field is the only thing that drastically increased storage since the upgrade, but it is the only thing that I can relate to according to the changelogs.

1 Like

Perhaps you should run the _disk_usage API it will show you the difference...

POST <new-index>/_disk_usage?run_expensive_tasks=true

POST <old-index>/_disk_usage?run_expensive_tasks=true

sythetic source can have a a significant impact perhaps 30-40% but not the difference you are showing. I suspect there is something else going on.

Like new fields are collected ... a flattened field is no longer flattened.

I suggest like @leandrojmp Looks closely at the mapping and diskusage API

2 Likes

Did you upgrade the integration package as well?

Can you share the complete mapping for both an old index and a new one?

Just run a GET <index-name>/_mapping on Kibana Dev Tools and share the results.

1 Like

Thanks for the suggestions @stephenb & @leandrojmp

I did what you requested for the following indices:

I appended the corresponding json-responses from the requests. Just for reference, I included the index-settings as well. There you can see, that both indices used Kubernetes Integration version 1.80.0.

And just to answer the other question: Yes we did upgrade the integration(s), but that happened before the stack upgrade on around 4th or 5th of February.
Just for completeness, I would like to include the information that Kubernetes is not the only affected integration for this issue. The Elasticsearch index metrics datastream from Stack Monitoring is also affected, but we already deleted the indices from before the update, therefore I cannot share any reference anymore.

Old index disk usage: Old index disk usage - Pastebin.com

New index disk usage: New index disk usage - Pastebin.com

Old index mappings: Old index mappings - Pastebin.com

New index mappings: New index mappings - Pastebin.com

Old index settings: Old index setting - Pastebin.com

New index settings: New index settings - Pastebin.com

The mappings are the same, the only difference is the source mode.

The settings are also basically the same, the only difference again is related to the source mode changes.

But the size is really different, the indices does not have the same amount of document, but I think they are pretty close that they can be compared.

The storage response brings the size from the primaries, so the sizes would be something close to 750 MB and 10 GB, there is a 2 million events difference, but a difference of more than 10 times in the size.

Some fields have a similar size, others are 2 to 3 times bigger, and others are 10 or more times bigger.

The _source of the new index is responsible for almost all the size of the index:

      "_source": {
        "total": "8.8gb",
        "total_in_bytes": 9487965794,
        "inverted_index": { 
          "total": "0b",
          "total_in_bytes": 0 
        },
        "stored_fields": "8.8gb",
        "stored_fields_in_bytes": 9487965794,
        "doc_values": "0b",
        "doc_values_in_bytes": 0,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0 
      }

Unless there is a bug somewhere I would consider this increase totatlly unexpected, I would expect the index going from 750 MB to maybe, 2 GB, even 2.5 GB, more than doubling, but going from 750 MB to 10 GB is not expected.

Maybe @stephenb can get some internal insight on this, but this difference is too big, to the point that could lead someone to ditch Elasticsearch and look for another tool.

Do you have logs data streams or only metrics? Can you share a similar comparison for logs data streams?

If you are storing the source field and want to minimize size, make sure you use the best_compression codec.

It is also worth noting that storage efficiency tend to improve with index size and that is why I have generally recommended not comparing indices with very different sizes in terms of document count.

Another thing that can result in size differences is the number of segments and where in the merging cycle an index is. For a proper comparison I would recommend forcemerging both indices down to a single segment once they have roughly the same number of documents.

Having said that I agree that the size difference you are seeing per document is extreme.

The thing is is that this is all the default templates from default integrations, the compression is already the best_compression, at least for the logs data stream.

You cannot remove the _source, this is not an option when you use Kibana as this breaks Kibana.

The size difference is really extreme and unexpected, it seems that something is wrong.

1 Like

@leandrojmp @Christian_Dahlqvist Thanks for the replies and professional insights.

First: Yes I can look for a log related data stream. However I assume I won't be able to provide it earlier as by tomorrow.

I also thought about increasing compression in general as it is stated as a valid workaround in the docs. Unfortunately force merging will not be possible as our ILM triggers a rollover once the primary shard reaches a size of 20 GB (this is to achieve a fluent data lifecycle without increasing single-node storage in a enormous manner).
Therefore with the now increased storage requirements, indices will never reach such a high document count anmore as they get a rollover beforehand.