Vector search : when input is bigger than max_seq


Hello,

I'm trying to handle long input text fields in Elasticsearch, where the token size often exceeds the maximum sequence length.

I have two questions:

  1. Applying Vector Search with Large Input Fields:

    • Is there a proper way to apply vector search when the input field's token size is consistently larger than the maximum sequence length?
    • For example, I want to index the official documentation of Elasticsearch, specifically in the "content" field of the elasticsearch_article index. If the average token count of the content is 10,000 and the max sequence length is 2,000, how can I effectively apply vector search in this scenario?
  2. Using Non-SentenceTransformer Models in Elasticsearch Pipeline:

    • How can I use models that are not SentenceTransformers in the Elasticsearch pipeline?
    • I found this model on Hugging Face: kobigbird-bert-base. It seems suitable for my service, but it is not implemented as a SentenceTransformer.
    • Is there a way to use this model in Elasticsearch?

Here is the command I used:

eland_import_hub_model --url https://xxxxx:9200/ --es-username xxxx --es-password xxxxx --hub-model-id monologg/kobigbird-bert-base --task-type text_embedding --insecure

However, the logs indicate an issue:

No sentence-transformers model found with name monologg/kobigbird-bert-base. Creating a new one with MEAN pooling.

Despite the log message stating that a new model is created, I cannot find any machine learning model in Elasticsearch:

GET /_ml/trained_models

Hey @dan_kim !

  1. There are two options here:
  • Truncate your text. Not ideal, but this is done by default for you on the inference processor when ingesting text.
  • Use chunking, to divide your text into smaller passages that will get included as a nested field in your document. Each passage will have its embeddings calculated separately, and some overlap can be included to have semantically meaningful passages.

In order to apply chunking, you can use an external process, use a script processor, or use the semantic_text field type (that is now available on serverless, and will be available on 8.15) to do automatic chunking for you.

  1. Elasticsearch only supports Sentence Transformers for deploying into the Elasticsearch cluster. However, you can use the Inference API to refer to embedding services outside of the Elasticsearch stack, using for example HuggingFace as a service provider.

Hope that helps!

2 Likes

Thank you for your answer,

i will try scripting soon, Thanks

@Carlos_D

I decided to change my approach and split the content into keywords at the data pipeline stage to apply it to the model.

What do you think about this approach?

  1. Use the E5 model to relate both Korean and English content.
  2. Use the OpenAI API to extract keywords from each sentence of the content.

In my case, it is an initial service, so there isn't much traffic or many documents in this service.

For example, for the content:

"Despite the log message stating that a new model is created, I cannot find any machine learning model in Elasticsearch,"

the OpenAI API extracted the keywords: ["log", "message", "new", "model", "machine", "Elasticsearch", "es"].

1 Like

Hey @dan_kim :

Extracting keywords from each passage will lessen the context for the vector embeddings - keep in mind that models generate embeddings taking into account not just the words in isolation, but the overall sentence context. Generating embeddings from keywords will have much worse precision and recall on your search IMO.

You could summarise your documents to a maximum number of words using OpenAI and then just generate embeddings for your document summary. But again, you will be losing context and depend on a previous summarization done by a model.

Using a chunking strategy as described above should be the way to go to maximize your search results relevancy.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.