I'm trying to handle long input text fields in Elasticsearch, where the token size often exceeds the maximum sequence length.
I have two questions:
Applying Vector Search with Large Input Fields:
Is there a proper way to apply vector search when the input field's token size is consistently larger than the maximum sequence length?
For example, I want to index the official documentation of Elasticsearch, specifically in the "content" field of the elasticsearch_article index. If the average token count of the content is 10,000 and the max sequence length is 2,000, how can I effectively apply vector search in this scenario?
Using Non-SentenceTransformer Models in Elasticsearch Pipeline:
How can I use models that are not SentenceTransformers in the Elasticsearch pipeline?
I found this model on Hugging Face: kobigbird-bert-base. It seems suitable for my service, but it is not implemented as a SentenceTransformer.
Is there a way to use this model in Elasticsearch?
Truncate your text. Not ideal, but this is done by default for you on the inference processor when ingesting text.
Use chunking, to divide your text into smaller passages that will get included as a nested field in your document. Each passage will have its embeddings calculated separately, and some overlap can be included to have semantically meaningful passages.
In order to apply chunking, you can use an external process, use a script processor, or use the semantic_text field type (that is now available on serverless, and will be available on 8.15) to do automatic chunking for you.
Elasticsearch only supports Sentence Transformers for deploying into the Elasticsearch cluster. However, you can use the Inference API to refer to embedding services outside of the Elasticsearch stack, using for example HuggingFace as a service provider.
Extracting keywords from each passage will lessen the context for the vector embeddings - keep in mind that models generate embeddings taking into account not just the words in isolation, but the overall sentence context. Generating embeddings from keywords will have much worse precision and recall on your search IMO.
You could summarise your documents to a maximum number of words using OpenAI and then just generate embeddings for your document summary. But again, you will be losing context and depend on a previous summarization done by a model.
Using a chunking strategy as described above should be the way to go to maximize your search results relevancy.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.