Hello,
I am currently exploring the integration of semantic search into my web application, which already uses traditional keyword-based search. The keyword search functionality is working exceptionally well, and I want to ensure the new semantic search complements it seamlessly.
I have two main questions regarding the implementation of semantic search:
- Granularity of Vector Embeddings:
- Would you recommend transforming the entire content of the
body
field into a single vector for each document?
- Alternatively, is it better to split the content into smaller chunks (e.g., sentences or paragraphs) and generate multiple vectors for each document? I would like to understand the trade-offs in terms of granularity, retrieval precision, and overall search performance.
- Impact of Chunk-Level Processing on Document Count:
- If the
body
field is split into chunks and each chunk is indexed as a separate document with its own vector, this could significantly increase the number of documents in the index.
- Would this increase in document count have a noticeable impact on query performance or scalability?
- Are there best practices to efficiently manage such an approach, including linking chunks back to their parent documents during retrieval?
I would greatly appreciate your insights on these concerns and any recommendations for best practices in implementing semantic search while ensuring scalability and performance.
Thank you
Hi @Shirley_bk , welcome to the Elastic community!
Option 1: Single Vector for the Entire Document
- Pros:
- Easier to manage, as you’ll have one vector per document.
- Lower storage and indexing overhead since you’re only generating and storing one embedding per document.
- Simpler retrieval logic.
- Cons:
- May miss nuances in long documents where relevance might vary across different sections.
- Can reduce retrieval precision, as the vector represents the average meaning of the entire document.
Option 2: Multiple Vectors by Splitting into Chunks
- Pros:
- Improves retrieval precision since smaller chunks (e.g., sentences or paragraphs) can better match specific queries.
- Enables fine-grained semantic matching, which is particularly useful for long or diverse content.
- Cons:
- Increases the number of vectors and, consequently, storage requirements.
- Adds complexity in linking chunks back to the parent document during retrieval.
If your content is diverse and queries tend to focus on specific parts of documents, chunking is often the better approach. However, for short and cohesive content, a single vector per document may suffice. A good middle ground might be to experiment with splitting by paragraphs or sections and observing retrieval performance.
Recomendation:
When the question is relatively broad, and the person wants to start, I usually say: start small, but start right away. By building a proof of concept for a RAG in 3 days, your knowledge will grow significantly, and your questions will become increasingly specific, based on facts and analysis.
In addition to the forum, this link contains all the material you need:
I also highlight two articles that, depending on your knowledge of vector search, chunks, and RAG, might be a good starting point:
1 Like