How to exclude dense_vector field from being stored

ruslaniv · November 29, 2022, 11:26am

How can I exclude dense_vector field from being stored in the _source?

I ran an experiment indexing approximately 6_000_000 documents and here is what I found out after running

curl --location --request POST 'http://127.0.0.1:9200/index_name.vec/_disk_usage?run_expensive_tasks=true

The dense_vector field indeed takes approximately 4 Kb in the index:
(4 × 1024 + 4) / 1024 / 1024 / 1024 × (5872381 + 777907) = 25,3 Gb
which is confirmed by:
"knn_vectors": "26.2gb"

But it is also storing dense vectors as raw floats without any optimization in the _source field, because if I save a plain text file with 1024 floats I get approximately 21 Kb file size:
21 / 1024 / 1024 × (5872381 + 777907) = 133 Gb
which is somewhat confirmed by
"stored_fields": "107.9gb"
and I can also see the vectors in the search output.

So how do I exclude dense_vector from _source, since I do not need this field in the document representation anyway?
I tried setting:

        "title_vector": {
            "type": "dense_vector",
            "dims": 1024,
            "index": true,
            "similarity": "dot_product",
            "store": false
        },

But I got
"reason": "unknown parameter [store] on mapper [title_vector] of type [dense_vector]"

BenTrent · November 29, 2022, 12:27pm

@ruslaniv take a look here: _source field | Elasticsearch Guide [8.5] | Elastic

There are many options, including a new one called: _source field | Elasticsearch Guide [8.5] | Elastic

Though the include/exclude option may be the one you want.

ruslaniv · November 30, 2022, 7:37am

I just don't understand what is going on with Elastic!

I created two indexes with the exact same mappings except just one parameter where one mapping had the dense_vector excluded from the _source and the other did not:

"mappings": {
        "_source": {"excludes": ["title_vector"]},
        "properties": {
        ...}

then I indexed the same 1_000 documents into both indexes.

vector_in_source       1000            0     21.5mb         21.5mb
no_vector_in_source    1000            0     21.2mb         21.2mb

Upon inspecting the indexes, it turns out that:

Index with vectors in source is storing dense_vector as plain floats in the source as expected
Index with no vectors in source does not store dense vectors BUT it creates a new field called _recovery_source with the size equal to what 1000 1024-dim vectors stored as plain floats would occupy.

So even though I explicitly excluded dense vectors from being stored in Elastic they are still stored just in a new field!

BenTrent · November 30, 2022, 2:29pm

Looks like you already found this bug: `_recovery_source` sometimes remains after merge · Issue #82595 · elastic/elasticsearch · GitHub

Here is another issue explaining the behavior: Indices with "_source.enabled: false" same size as indices with "_source.enabled: true" · Issue #41628 · elastic/elasticsearch · GitHub

@ruslaniv could you try POST no_vector_in_source/_forcemerge ? That may remove it.

Another option is to try synthetic source: _source field | Elasticsearch Guide [8.5] | Elastic But that is currently is an "all or nothing" deal. I am not sure if it runs into the _recovery_source issue or not.

ruslaniv · December 1, 2022, 7:54am

Ben, thank you for your help.

Unfortunately synthetic source is not an option since we extensively use flattened field which is not supported by synthetic source;
I did try running _forcemerge on that index but it did not resolve the problem, the _recovery_source field is still present in the analysis results and the index still takes the same amount of space

Christian_Dahlqvist · December 1, 2022, 8:02am

Did you forcemerge down to a single segment so you know all segments have been processed?

ruslaniv · December 1, 2022, 8:20am

Yes I did run

POST no_vector_in_source/_forcemerge?max_num_segments=1

but it did not help

system · December 29, 2022, 8:20am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Dense vectors taking up much more space than expected Elasticsearch vector-search	2	364	November 8, 2024
Access field that was excluded from _source Elasticsearch vector-search	6	575	July 18, 2024
Don't store certain fields by default Elasticsearch	12	436	July 6, 2017
What is _recovery_source field? Elasticsearch vector-search	3	827	December 29, 2022
Custom _source compression / compaction to reduce disk usage Elasticsearch	13	1205	July 6, 2017

How to exclude dense_vector field from being stored

Related topics