Extremely increased disk usage since upgrade to 8.17.1

This is interesting thread, and I am not sure I follow completely, so excuse the contribution if unhelpful:

The old index in this case is .ds-metrics-kubernetes.container-inf067_osprod_prod-2025.02.09-000101

it has

  "_source": {
    "mode": "synthetic"
  }

in the mapping and does not use massive disk space per doc.

The new index in this case is .ds-metrics-kubernetes.container-inf067_osprod_prod-2025.02.10-000103

it has

  "_source": {
    "mode": "stored"
  },

in the mapping and uses a lot more disk usage per doc.

So the synthetic source setting was way more optimal, for these specific docs, in disk usage. This is as documented.

While this on-the-fly reconstruction is generally slower than saving the source documents verbatim and loading them at query time, it saves a lot of storage space

So we only surprised at how much more storing the _source is compared to synthetic source. Er, why? Surely that depends completely on that actual data, the relative cardinalities of the various fields, how close to "random" the data is, and so on.

And why not just enable synthetic source again? btw, @Alphayeeeet has not yet commented on which license he/she is using. And even if thats blocked by a basic license, isn't this sort of thing, er, the reason some "nice" things are paid features? I mean, it was a commercial decision to make synthetic _source Enterprise-only, right?

I had a look at my own data, nothing serious, but found this useful to see the average storage_size per doc of my indices.

curl -s -k -u "${EUSER}":"${EPASS}"  "https://${EHOST}:${EPORT}/_cat/indices?index=.*&bytes=b&format=json" | jq -r '.[] | [ .index , ."docs.count" , ."pri.store.size" ] | @tsv' | awk '$2>0{printf "index: %-72s doc_count: %12d pri_store_size %12d average_doc_size %12.0f\n",$1,$2,$3,$3/$2}' | sort -k8nr | head

Synthetic source is enabled by default for TSDS, if after the updated it reverted to stored source this means that the license does not support synthetic source.

The issue is that the size difference with and without synthetic source is unexpected and extreme.

An index doubling the size without synthetic source could be expected, but an index getting 10 times bigger is not expected and should be investigated.

2 Likes

OK. 2x is plucked from the air a bit, might be 1.2x, might be 2.5x, might be anything really. Of course I don't disagree with investigating, 10x is extreme, and with some urgency.

As I understand it, for now @Alphayeeeet just has to either live with it (for the time being at least) or get a license which might (likely would) resolve the issue. There's no workaround. Theoretically could maybe downgrade, but not without a lot of hassle.

Thanks for the reply. Yes we are currently using the basic license, but were thinking of upgrading to Platinum for a longer time by now. Now that this feature would be enterprise-only and is literally a must buy now, we are currently investigating what our way to go is. In general, this is not a "nice" thing, but rather a critical change that hasn't been communicated in such seriousness as should have been.
Downgrading in general isnt supported as I thought?! Maybe you could give some advice here. Thanks

Not suggested or supported.

My $0.02 - a "effectively downgrading" scheme would be a (temporary) hack, spin up a new cluster on older version, for just impacted data, sync stuff, run in parallel for a bit, point some stuff here and some stuff there, ... in short a royal PITA.

If it's "bedroom IT", a hobby project, possibly doable. Many will have done more hacky things.

In any "someone is being paid" scenario, forget it.

As it's currently in production-use we are rather in the forget it section. Also just the amount of data and configuration would be unsuiteable to migrate to another cluster. But thanks for the mentioning anyway

@leandrojmp As a response for the question about logs: Yes we have logs and after comparison of rollover indices before and after the upgrade, I cannot speak of any major size increases in general. So it seems that it is only affecting metrics (especially Kubernetes and Stack Monitoring from my view). We use other metric integrations too, but in size comparison there are not relevant (1-2 GB in comparison to 10-20 GB on Kubernetes/Elasticsearch).

Unfortunately I can only share very minimal, due to confidentiality. I can say that our log indices from before the upgrade were also using source.mode: "STORED" in index settings

Before upgrade:

...
"mapping": {
        "coerce": "false",
        "nested_fields": {
          "limit": "50"
        },
        "synthetic_source_keep": "none",
        "depth": {
          "limit": "20"
        },
        "field_name_length": {
          "limit": "9223372036854775807"
        },
        "ignore_above": "2147483647",
        "nested_objects": {
          "limit": "10000"
        },
        "source": {
          "mode": "STORED"
        },
        "dimension_fields": {
          "limit": "32768"
        },
        "synthetic_source": {
          "skip_ignored_source_read": "false",
          "skip_ignored_source_write": "false"
        }
      },
      "source_only": "false",
...

After upgrade:

...
"mapping": {
        "coerce": "false",
        "nested_fields": {
          "limit": "50"
        },
        "synthetic_source_keep": "none",
        "depth": {
          "limit": "20"
        },
        "field_name_length": {
          "limit": "9223372036854775807"
        },
        "ignore_above": "2147483647",
        "nested_objects": {
          "limit": "10000"
        },
        "source": {
          "mode": "STORED"
        },
        "dimension_fields": {
          "limit": "32768"
        },
        "synthetic_source": {
          "skip_ignored_source_read": "false",
          "skip_ignored_source_write": "false"
        }
      },
      "source_only": "false",
...

However I can confirm no signifact growth to the _source field in comparison to metrics. Is this somehow expected? Also we benefit a lot from index compression in log indices, as far as I can say.

I assume, as @RainTown already mentioned, that we only have 2 options currently:

  • Live with it and extend storage space drastically or reduce data retention and replication
  • Active the trial license and hope for a fix during that period. (Unfortunately we wanted to save the trial license for a focus period to evaluate the paid features in general, which in this case wouldn't be possible anymore.)

FYI: you can always start a new cluster (from scratch) and activate the trial on it for 30 days. And send your data to this cluster for evaluation...

I assume it didn't come through what I meant exactly. Our agents are connected to the Fleet-Servers in the production cluster (which has currently no license). As a second data-output is also a paid feature, I'm not sure how we should send the data from one cluster to another without building a middleware system that does exactly this. We need the data in the production cluster.

The features we wanted to evaluate are especially alerting connectors, which require additional custom middleware to be able to send to our central alerting system. Thats the main reason, why we need some focus period to get that checked.

But this is only a side-topic and not in the focus of this thread. Therefore I would like to focus on the main issue here.

A fix assumes a bug.

Maybe there is one, but it's not yet confirmed that there is. Even if there is, you have no a priori way to know how effective it would be, and anyways your storage needs will still have increased, maybe significantly, hopefully not as extremely.

In short, speaking personally, I would not plan on the assumption you only need a bridging period. YMMV.

1 Like

If they weren't using synthetic source, then no difference in size is expected.

The increased in size for the metrics is really unexpected and extreme, I suggest that you open a github issue in the Elastic repository so Elastic can check if this is a bug or not.

I think for logs you are right with the ~50% difference but for metrics it is considerably more. Storage wins for time-series data in Elasticsearch - Elasticsearch Labs has the full blog post with a lot more details but it mentions (for one data-set and there are more optimizations):

This is evident in ES 8.7 that uses synthetic source for TSDS by default. In this case, the storage footprint drops to 6.5GB - a 8.75x improvement in storage efficiency.

2 Likes

I think this answer the question then, this huge increase is expected for metrics data streams.

@Alphayeeeet I think you have a couple of options, increase the storage, decrease the retention, see if acquiring a license is justified or change observability tools.

Jesus! We just renewed our platinum license in November for two years they had never said that the space would increase by 10! This is messed up! Really messed-up I'm gonna open a support ticket but if this is true you elastic really messed up. Like big time I'm not going to tell my bosses that we need to increase our storage 10x because of a field!

1 Like

I think the thread explains the specific situation that @Alphayeeeet has encountered fairly well.

But it seems you have perhaps slightly misunderstood some nuances, ā€œbecause of a fieldā€ is certainly not an accurate way to summarise.

In any case, as I wrote above, storage requirements increases would vary, dependant on the actual data.

All that said, it seems a quite impactful commercial decision to make such a significant licensing change in the midst of the 8.x release cycle, where one of the most impacted use cases (metrics) is such a common use case out in the field (excuse the pun). Itā€™s also noted already above the negative impression this could have on some potential customers too.

Reminder that I donā€™t work for, never have worked for, Elastic, just a curious user and volunteer who tried to help out on the forums sometimes.

1 Like

Given that this dramatic increase in storage size only applies to metrics, what is your retention period for these indices? If this is reasonably short I do not see the need to reindex due to major version upgrades being a good reason to ensure source is available. Indices created in one major version can always be read in the next major version so it is only the case when you have a metrics index that need to be read in a cluster running 2 major versions higher where this is an issue. If I were in your shoes I would consider disabling source for this data as it is almost always analysed through aggregations in Kibana where availability of the source does not matter.

Yes we already thought about that. Currently we have different retention periods for metrics (but none of them are that long to cover 2 major upgrades). At the time when I wrote that post, I thought logs were also affected. Now as @leandrojmp mentioned, apparently it is not possible to disable _source in general, even if we do not need to use the reindex task.

1 Like

It is surprising to me that this feature is not available with a Platinum level license as it seem to makes Elasticsearch and TSDS uneconomical for large metrics use cases without an Enterprise subscription level. I can see why disabling source is not allowed as a lot of users would not mind this and TSDS is available at the Basic level. To me it would have made more sense to make TSDS a commercial feature at any license level but not have it available at the Basic level, but I do not know the reasoning behind this as I no longer work for Elastic.

You do get some performance improvements by using TSDS, but it may be worthwhile looking into indexing your metrics into a standard data stream where you can disable source and see what impact that has on performance, usability and storage size.

2 Likes

To clear up any confusion for existing customers: Communication describing the change has been sent via email to all paying customers in Dec 2024, including information about grandfathering of current paid subscriptions (link accessible with active support contracts)

1 Like