How to exclude dense_vector field from being stored

How can I exclude dense_vector field from being stored in the _source?

I ran an experiment indexing approximately 6_000_000 documents and here is what I found out after running

curl --location --request POST 'http://127.0.0.1:9200/index_name.vec/_disk_usage?run_expensive_tasks=true

The dense_vector field indeed takes approximately 4 Kb in the index:
(4 × 1024 + 4) / 1024 / 1024 / 1024 × (5872381 + 777907) = 25,3 Gb
which is confirmed by:
"knn_vectors": "26.2gb"

But it is also storing dense vectors as raw floats without any optimization in the _source field, because if I save a plain text file with 1024 floats I get approximately 21 Kb file size:
21 / 1024 / 1024 × (5872381 + 777907) = 133 Gb
which is somewhat confirmed by
"stored_fields": "107.9gb"
and I can also see the vectors in the search output.

So how do I exclude dense_vector from _source, since I do not need this field in the document representation anyway?
I tried setting:

        "title_vector": {
            "type": "dense_vector",
            "dims": 1024,
            "index": true,
            "similarity": "dot_product",
            "store": false
        },

But I got
"reason": "unknown parameter [store] on mapper [title_vector] of type [dense_vector]"

@ruslaniv take a look here: _source field | Elasticsearch Guide [8.5] | Elastic

There are many options, including a new one called: _source field | Elasticsearch Guide [8.5] | Elastic

Though the include/exclude option may be the one you want.

1 Like

I just don't understand what is going on with Elastic!

I created two indexes with the exact same mappings except just one parameter where one mapping had the dense_vector excluded from the _source and the other did not:

"mappings": {
        "_source": {"excludes": ["title_vector"]},
        "properties": {
        ...}

then I indexed the same 1_000 documents into both indexes.

vector_in_source       1000            0     21.5mb         21.5mb
no_vector_in_source    1000            0     21.2mb         21.2mb

Upon inspecting the indexes, it turns out that:

  1. Index with vectors in source is storing dense_vector as plain floats in the source as expected
  2. Index with no vectors in source does not store dense vectors BUT it creates a new field called _recovery_source with the size equal to what 1000 1024-dim vectors stored as plain floats would occupy.

So even though I explicitly excluded dense vectors from being stored in Elastic they are still stored just in a new field!

Looks like you already found this bug: `_recovery_source` sometimes remains after merge · Issue #82595 · elastic/elasticsearch · GitHub

Here is another issue explaining the behavior: Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" · Issue #41628 · elastic/elasticsearch · GitHub

@ruslaniv could you try POST no_vector_in_source/_forcemerge ? That may remove it.

Another option is to try synthetic source: _source field | Elasticsearch Guide [8.5] | Elastic But that is currently is an "all or nothing" deal. I am not sure if it runs into the _recovery_source issue or not.

1 Like

Ben, thank you for your help.

  1. Unfortunately synthetic source is not an option since we extensively use flattened field which is not supported by synthetic source;

  2. I did try running _forcemerge on that index but it did not resolve the problem, the _recovery_source field is still present in the analysis results and the index still takes the same amount of space

Did you forcemerge down to a single segment so you know all segments have been processed?

1 Like

Yes I did run

POST no_vector_in_source/_forcemerge?max_num_segments=1

but it did not help

1 Like

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.