Huge amount of Space not released for _recovery_source

Wei_Chen · March 5, 2025, 6:26pm

Hi, I am trying to optimize my dense vector space usage of the _source field but haven't got any luck so far. I tried with turning on synthetic source, but then I observed that I now see a new _recovery_source field taking 4x of the space of my vector. I attempted to both shortening the soft delete retention using index.soft_deletes.retention_lease.period, and force-merging my index using

client.indices.forcemerge(index=my_index)

None of these methods helps - the _recovery_source remains even after a few days the index get force-merged. What are things I should look into here?

This is costing us a horrible amount of budget b/c Elasticsearch is storing 5x what it's supposed to store and so far there's no obvious documentation optimizing this. We are consider other Vector DB providers if we couldn't solve this issue.

Appendix:
My server version: v8.17.2
Disk usage breakdown:

{
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "54.346_embedding_1740616609": {
    "store_size": "11.9mb",
    "store_size_in_bytes": 12546243,
    "all_fields": {
      "total": "11.9mb",
      "total_in_bytes": 12527444,
      "inverted_index": {
        "total": "12.4kb",
        "total_in_bytes": 12753
      },
      "stored_fields": "9mb",
      "stored_fields_in_bytes": 9444229,
      "doc_values": "446b",
      "doc_values_in_bytes": 446,
      "points": "613b",
      "points_in_bytes": 613,
      "norms": "0b",
      "norms_in_bytes": 0,
      "term_vectors": "0b",
      "term_vectors_in_bytes": 0,
      "knn_vectors": "2.9mb",
      "knn_vectors_in_bytes": 3069403
    },
    "fields": {
      "__soft_deletes": {
        "total": "18b",
        "total_in_bytes": 18,
        "inverted_index": {
          "total": "0b",
          "total_in_bytes": 0
        },
        "stored_fields": "0b",
        "stored_fields_in_bytes": 0,
        "doc_values": "18b",
        "doc_values_in_bytes": 18,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0
      },
      "_id": {
        "total": "20.5kb",
        "total_in_bytes": 21025,
        "inverted_index": {
          "total": "10.4kb",
          "total_in_bytes": 10667
        },
        "stored_fields": "10.1kb",
        "stored_fields_in_bytes": 10358,
        "doc_values": "0b",
        "doc_values_in_bytes": 0,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0
      },
      "_primary_term": {
        "total": "0b",
        "total_in_bytes": 0,
        "inverted_index": {
          "total": "0b",
          "total_in_bytes": 0
        },
        "stored_fields": "0b",
        "stored_fields_in_bytes": 0,
        "doc_values": "0b",
        "doc_values_in_bytes": 0,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0
      },
      "_recovery_source": {
        "total": "8.9mb",
        "total_in_bytes": 9416321,
        "inverted_index": {
          "total": "0b",
          "total_in_bytes": 0
        },
        "stored_fields": "8.9mb",
        "stored_fields_in_bytes": 9416321,
        "doc_values": "0b",
        "doc_values_in_bytes": 0,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0
      },
      "_seq_no": {
        "total": "1012b",
        "total_in_bytes": 1012,
        "inverted_index": {
          "total": "0b",
          "total_in_bytes": 0
        },
        "stored_fields": "0b",
        "stored_fields_in_bytes": 0,
        "doc_values": "399b",
        "doc_values_in_bytes": 399,
        "points": "613b",
        "points_in_bytes": 613,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0
      },
      "_source": {
        "total": "17.1kb",
        "total_in_bytes": 17550,
        "inverted_index": {
          "total": "0b",
          "total_in_bytes": 0
        },
        "stored_fields": "17.1kb",
        "stored_fields_in_bytes": 17550,
        "doc_values": "0b",
        "doc_values_in_bytes": 0,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0
      },
      "_version": {
        "total": "29b",
        "total_in_bytes": 29,
        "inverted_index": {
          "total": "0b",
          "total_in_bytes": 0
        },
        "stored_fields": "0b",
        "stored_fields_in_bytes": 0,
        "doc_values": "29b",
        "doc_values_in_bytes": 29,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0
      },
      "content_id": {
        "total": "2kb",
        "total_in_bytes": 2086,
        "inverted_index": {
          "total": "2kb",
          "total_in_bytes": 2086
        },
        "stored_fields": "0b",
        "stored_fields_in_bytes": 0,
        "doc_values": "0b",
        "doc_values_in_bytes": 0,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "0b",
        "knn_vectors_in_bytes": 0
      },
      "embedding": {
        "total": "2.9mb",
        "total_in_bytes": 3069403,
        "inverted_index": {
          "total": "0b",
          "total_in_bytes": 0
        },
        "stored_fields": "0b",
        "stored_fields_in_bytes": 0,
        "doc_values": "0b",
        "doc_values_in_bytes": 0,
        "points": "0b",
        "points_in_bytes": 0,
        "norms": "0b",
        "norms_in_bytes": 0,
        "term_vectors": "0b",
        "term_vectors_in_bytes": 0,
        "knn_vectors": "2.9mb",
        "knn_vectors_in_bytes": 3069403
      }
    }
  }
}

RainTown · March 5, 2025, 7:04pm

What license do you have ?

Reason I ask is the synthetic source licensing changed in 8.17. It’s in the release notes.

Wei_Chen · March 5, 2025, 8:20pm

Hi Kevin, thanks for the quick response! We are on 'Monthly Enterprise' plan and our subscription is 'Enterprise with premium support'. Is that the 'license' you are asking about?

Wei_Chen · March 5, 2025, 8:24pm

Additionally, we attempted to use synthetic source mode and it did successfully removed _source field. However, it's creating another _recovery_source field which we couldn't manage to remove properly

nhat · March 5, 2025, 8:25pm

Hello @Wei_Chen

This is a known issue. See: https://github.com/elastic/elasticsearch/issues/116726 and https://github.com/elastic/elasticsearch/issues/41628.

In your setup, the segment is currently fully merged when the retention lease is advanced. However, if more data is ingested and that segment becomes mergeable again with other new segments, the _recovery_source will be eventually pruned by merges. There is a case where the segment is very large (>5GB) and fully merged, the _recovery_source won't be removed.

In Elasticsearch 8.18 or later, you can disable _recovery_source via a new index setting: index.recovery.use_synthetic_source. See: https://github.com/elastic/elasticsearch/pull/114618/

RainTown · March 5, 2025, 8:47pm

yes. Was just a guess. wrong in this case.

Wei_Chen · March 5, 2025, 9:25pm

Thank you Nhat! Our version is 18.7.2 and we couldn't seem to be able to upgrade to 8.18. The only provided upgrade option is 18.7.3. What should we do to get it upgrade?

Wei_Chen · March 5, 2025, 9:29pm

It's also fine if you've got any suggestions within the context of 18.7 (so that we don't have to upgrade) - for example, can I merge twice and trick the system to purge the larger _recovery_source field?

leandrojmp · March 5, 2025, 9:30pm

You need to wait for it to be released, 8.18 was not released yet.

Wei_Chen · March 11, 2025, 9:41pm

Great to know, thanks @leandrojmp . @nhat I'm wondering when could we expect the release? This issue is costing us a lot in storage, it'd be great to have an ETA on this

Wei_Chen · April 2, 2025, 9:14pm

@nhat would you mind kindly provide a timeline for 8.18 update? We are spending over $2k per month and is reaching the limit, and the next tier will 2x the cost. It doesn't make sense for us to continue to hold on this plan because the storage should have taken only 30% of the current size.

Topic		Replies	Views
What is _recovery_source field? Elasticsearch vector-search	3	845	December 29, 2022
Knn_vectors field understanding Elasticsearch vector-search	23	468	March 6, 2025
Dense vectors taking up much more space than expected Elasticsearch vector-search	2	438	November 8, 2024
How to exclude dense_vector field from being stored Elasticsearch vector-search	7	1451	December 29, 2022
Deleting _recovery_source Elasticsearch	1	289	April 3, 2023

Huge amount of Space not released for _recovery_source

Related topics