This is interesting thread, and I am not sure I follow completely, so excuse the contribution if unhelpful:
The old index in this case is .ds-metrics-kubernetes.container-inf067_osprod_prod-2025.02.09-000101
it has
"_source": {
"mode": "synthetic"
}
in the mapping and does not use massive disk space per doc.
The new index in this case is .ds-metrics-kubernetes.container-inf067_osprod_prod-2025.02.10-000103
it has
"_source": {
"mode": "stored"
},
in the mapping and uses a lot more disk usage per doc.
So the synthetic source setting was way more optimal, for these specific docs, in disk usage. This is as documented.
While this on-the-fly reconstruction is generally slower than saving the source documents verbatim and loading them at query time, it saves a lot of storage space
So we only surprised at how much more storing the _source is compared to synthetic source. Er, why? Surely that depends completely on that actual data, the relative cardinalities of the various fields, how close to "random" the data is, and so on.
And why not just enable synthetic source again? btw, @Alphayeeeet has not yet commented on which license he/she is using. And even if thats blocked by a basic license, isn't this sort of thing, er, the reason some "nice" things are paid features? I mean, it was a commercial decision to make synthetic _source Enterprise-only, right?
I had a look at my own data, nothing serious, but found this useful to see the average storage_size per doc of my indices.
curl -s -k -u "${EUSER}":"${EPASS}" "https://${EHOST}:${EPORT}/_cat/indices?index=.*&bytes=b&format=json" | jq -r '.[] | [ .index , ."docs.count" , ."pri.store.size" ] | @tsv' | awk '$2>0{printf "index: %-72s doc_count: %12d pri_store_size %12d average_doc_size %12.0f\n",$1,$2,$3,$3/$2}' | sort -k8nr | head