Hello!
I am currently doing a few tests on Vector Search and Vector Embeddings and found that with a subset of 1.000 documents I have an index with the size of 15mb(+-). When I create a new Index for the vector embeddings, I get an index that has a total of 300mb (+-).
I took another subset of data to check the sizes, this time my subset was 10.000 documents and the dataset.size was 300mb (+-) the embedding size that came out of this was 6gb(+-). this is still a 20* increase.
This would mean for a dataset.size that is 100GB I would need 2TB of space?
The Index with vector embeddings has passages(/chunks) as explained here.
- Is there any way to decrease this?
(I guess taking a different approach to chunking could help but I already tried to optimize the script given in the blog) - Is this normal?
- Is this even more efficient than normal BM25 search?
(because of the insane size)
Below a more "off-topic" question:
I saw when the embeddings finished the index was way larger. after a while the index size decreased to the point where it is now. (almost 2x as large) I saw this happening with the 10.000 documents index. when the embeddings had finished it was around 7gb, after a while I saw the dataset.size go up to 11gb+ then it went down to 6gb.
- My question here is what is happening that the size is way larger & why is it decreasing?
If this is not normal then what would I be doing wrong?
(I tested with a full index of text + vector embeddings.
I also tested with a separate index for just the vector embeddings.)
Ps. I used the E5 small model. I tested with others aswell however this was the smallest. for the same 15mb subset took the E5 large almost 800mb.
To summerize my questions:
- Is there any way to decrease the sizing?
- Is this normal?
- Is this still more efficient than normal search?
- What is happening behind the scenes with the index size?
Any help would be appreciated!
Kind regards,
Chenko