Vector embedding huge size increase

Hello!

I am currently doing a few tests on Vector Search and Vector Embeddings and found that with a subset of 1.000 documents I have an index with the size of 15mb(+-). When I create a new Index for the vector embeddings, I get an index that has a total of 300mb (+-).

I took another subset of data to check the sizes, this time my subset was 10.000 documents and the dataset.size was 300mb (+-) the embedding size that came out of this was 6gb(+-). this is still a 20* increase.

This would mean for a dataset.size that is 100GB I would need 2TB of space?

The Index with vector embeddings has passages(/chunks) as explained here.

  1. Is there any way to decrease this?
    (I guess taking a different approach to chunking could help but I already tried to optimize the script given in the blog)
  2. Is this normal?
  3. Is this even more efficient than normal BM25 search?
    (because of the insane size)

Below a more "off-topic" question:
I saw when the embeddings finished the index was way larger. after a while the index size decreased to the point where it is now. (almost 2x as large) I saw this happening with the 10.000 documents index. when the embeddings had finished it was around 7gb, after a while I saw the dataset.size go up to 11gb+ then it went down to 6gb.

  1. My question here is what is happening that the size is way larger & why is it decreasing?

If this is not normal then what would I be doing wrong?
(I tested with a full index of text + vector embeddings.
I also tested with a separate index for just the vector embeddings.)

Ps. I used the E5 small model. I tested with others aswell however this was the smallest. for the same 15mb subset took the E5 large almost 800mb.

To summerize my questions:

  1. Is there any way to decrease the sizing?
  2. Is this normal?
  3. Is this still more efficient than normal search?
  4. What is happening behind the scenes with the index size?

Any help would be appreciated!

Kind regards,
Chenko

Hi!

The reason the size increases is due to storing the extra passages as nested fields as well as all the actual embedding representations generated by the model.

The small E5 model still has a size of 384.

You could indeed work on your chunking strategy - maybe rather than generating embeddings for each sentence, you can do each paragraph, or whatever the max acceptable dimension is for the model (I think for the small E5 it's 512).

You can take a look at this example for some more chunking customization available through langchain.
Or you can ofc still use the painless script and modify it to split differently than by end of sentences. (The generated embedding size will be the same regardless so it's less space taken if you have 3 big chunks per document or 30 smaller ones).

In terms of if this is "better" than a normal model - this depends on your data and use case. You can choose a few diverse queries and test out normal and knn searches in parallel and evaluate the results. Similarly, you can choose a few chunking dimensions and test that too (does it work better with sentences, paragraphs, embedding each document with no chunking, etc).

Hope this helps for some direction!

1 Like

Thanks for your reply!

Between the lines we conclude that for Semantic Search Projects we would need fundamentally more storage. (more or less 20 times)