I am new here trying to use Elasticsearch for vector similarity search.
My dataset is as large as 100M records, each containing a 384 dimensions vector and a string payload.
I am building the HNSW index type, but it will raise OOM errors even when I use a small portion (10M) of my data in a 25Gb docker container. Considering my machine's total RAM available, it might be hard to increase RAM to fit all 100M records.
Any suggestion on uploading all data into the database with limited RAM resources?
Hi @louis_sg , have you tried this solution from stackoverflow ?indexing - Uploading large 800gb json file from remote server to elasticsearch - Stack Overflow
Split your data into smaller chunks and send them to Elasticsearch using multiple Bulk Requests.
Yes, I am already using this method to upload records in batches of 64.
Is it possible to show your mappings about the field that you are using to create the embeddings? and the max size of it?
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.