I did some testing with Elser and after that I used the E5 model to play around with semantic search. I use the knn search to create a query on multiple embedding fields. I also looked at the retrievers to create a hybrid query. This all works fine and I learned a lot about this topic.
Now I saw that there is a new field type "semantic_text" and read the documentation about this.
The documentation said that using "semantic_text" means that chunking will be done automatically on your behalf when indexing.
How is this working when not using "semantic_text" fields, with for example with "dense_vector"?
For now I made a knn query on multiple fields and use k, num_candidates, similarity and filter to get the nearest neighbors as results.
Can you already query multiple fields with a semantic query?
If I want to use some fine tune options like k, num_candidates .... it's still better to use knn instead of semantic query?
When using a dense_vector field, you have to implement your own chunking before indexing your data in Elasticsearch. IIRC, this has been done in the past through scripts in an ingest pipeline.
No, you currently cannot query multiple fields with a single semantic query, but you also cannot query multiple fields with a single knn query. I assume you are referring to this syntax? This is effectively syntactic sugar around executing multiple knn queries and joining them with a boolean OR. You can do the same with multiple semantic queries in a bool query:
Correct. If you want to fine-tune your query, use the appropriate vector query against your semantic_text field. This is covered in our advanced search documentation.
I have documents with a title and a description that I use to create embeddings and I didn't implement any chunking in the pipeline. If the data in the description field is very large, the embedding generated without chunking is for the entire description field or just a small part (for example 150 or 250 characters)? What does it mean that I now have no chunking for the accuracy of my search results?
For the BM25 query I use the "multi_match" to query multiple fields. For the "knn search" I indeed use the syntax you are referring to.
If I want to query the title and description fields when I use the semantic_text fields and semantic query, I have to do this via a bool query?
I the advanced search documentation they use the "nested" syntax.
Is this because of the chunks and I need to use this?
So with a normal knn query I don't use the chunks? Or I have to use the "inner_hits" (which can be quite resource-intensive I thought) syntax with the knn query
When I use a normal knn query (without nested syntax as above query) on a semantic_text field I get an error: "failed to create query: [knn] queries are only supported on [dense_vector] fields"
If the data in your description field is very large, it will be truncated to fit the model's window size (which varies by model) before the embedding is generated. This means that your semantic queries will match based on the truncated data. In other words, the embedding generated will only represent the contents of the description field up to the point at which it is truncated.
Correct. You are essentially already doing this with your knn search approach, you are just using some syntactic sugar to simplify the query representation a bit.
Correct. The semantic_text field uses a nested structure internally to index chunks and thus, when you want to perform a knn query on a semantic_text field, you need to account for this. We plan on simplifying this very soon so that you can transparently use a knn query on a semantic_text field without needing to manually account for these complications.
Normal knn queries do not query nested fields. They query dense_vector fields, which store only one embedding. If you wanted to store multiple embeddings manually (like you must do to store chunked embeddings), you would need to create a mapping of nested dense_vector fields (which is what the semantic_text field does internally) and you would have to write a similar query.
And a clarification: The query you wrote does not use inner_hits syntax. You use inner_hits syntax when you want to find and return the specific passage/chunk in the document that best matches the query. As written, the query you provided searches the nested chunks and returns the best matching whole documents (notice the slight but very important difference). There is a minor performance hit for this, but not nearly as much as using inner_hits.
The query syntax you provided looks correct. As I mentioned above, you will be able to simplify this in future ES releases
Because large documents needs to fit the model's window as you said, can I configure something about the chunk size/method when I use "semantic_text" or "dense_vector" field? Or is this done all automatically and you can't do anything about it.
Is there some Elasticsearch documentation about this chunking (characters, words, paragraphs and the overlap....etc). I like to know more about this topic because it's very important I think.
Chunking is done automatically when you use semantic_text. It creates 250 token chunks, with 100 token overlap between each chunk. This is not configurable in 8.15, but we are adding options to configure this through the Inference API in 8.16. We don't have docs for the chunking options yet, but you can expect to see them soon as we prepare the 8.16 stack release.
If you use the dense_vector field, chunking is completely in your hands. You can use whatever chunk size/method you want because you have to implement your own chunking for this field type.
Perhaps It might be worth looking at this blog. It uses ELSER but it has the chunking in it and multimodal search. This is before the latest releases so it doesn't use the semantic text type. You might be able to use this as a base.
I used the "semantic_text" field for an index and created an inference api with the "multilingual-e5-small" model.
I did a reindex of my index with 16239 documents but when it was finished I had 101047 documents in my new index. How is this possible?
health status index pri rep docs.count docs.deleted store.size
green open index_content-embedding_semantic_text 1 0 101047 0 908.5mb
green open index_content_20240704_1200 1 0 16239 24 81.9mb
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.