Embedding token size limit for ELSER2 model

Hi,

I am trying to create embedding for a string using .elser_model_2 model.
My embedding is get truncated at different token limits time at 283 and some time at 302.
I like to understand that is maximum token size limit for text embedding with .elser_model_2 model.

Below link says that "When you use ELSER for semantic search, only the first 512 extracted tokens from each field"

I understand that 512 tokens limit is during semantic search, but my question is about token limit during embedding vector generation.

I searched documentation, unfortunately i could not locate the place where is disucss about ELSER2 token limit for embedding vector generation.

Please help to understand the token limit of ELSER2 model while generating embedding vector ? and Provide help document location.
If the token limit during generation is less than 512, then what is the purpose of allowing upto 512 tokens during semantic search ?

Hi @SanthoshKMurugadass ,

Sorry for the confusion here. ELSER is expected to create a sparse vector based on the first 512 tokens of text, at any phase and for any text.

At ingest time, this applies to the field that the Inference Processor is provided as input.

At query time, this applies to the text expansion query string.

The documentation you linked does say this:

only the first 512 extracted tokens from each field of the ingested documents that ELSER is applied to are taken into account

but then adds

for the search process

Which I think is where your confusion is coming from. This doesn't mean that the 512 limit only applies at search time, it's just referencing the implication that you'd only do inference with ELSER for semanitic search use cases, so it's important for customers with those use cases to understand that 512 tokens is the limit.

My embedding is get truncated at different token limits time at 283 and some time at 302.

That's unexpected, as 512 should be the limit - unless your text is smaller than 512 tokens. What's leading you to believe that your embeddings are truncated at 283 or 302?

1 Like

@Sean_Story Thanks for your response.

I too expect ELSER to have token limit of 512 while embedding. But i see its getting terminated before 512 tokens.

For example, see my sample code below, where it get truncated at 276 tokens. For different strings it get truncated at different values. If you can help me to understand what is going on here i would be thankful.

Here is my sample code:

# ELSER 2 embedding token size check

chunk="""

The tokens generated by ELSER must be indexed for use in the text_expansion query. However, it is not necessary to retain those terms in the document source. You can save disk space by using the source exclude mapping to remove the ELSER terms from the document source.

Reindex uses the document source to populate the destination index. Once the ELSER terms have been excluded from the source, they cannot be recovered through reindexing. Excluding the tokens from the source is a space-saving optimsation that should only be applied if you are certain that reindexing will not be required in the future! It’s important to carefully consider this trade-off and make sure that excluding the ELSER terms from the source aligns with your specific requirements and use case. Review the Disabling the _source field and Including / Excluding fields from _source sections carefully to learn more about the possible consequences of excluding the tokens from the _source.

Get the foundation for a full vector search experience and generative AI integration. Use a single platform to create, store, and search embeddings for dense retrieval and capture your unstructured data’s meaning and context — across text, images, videos, audio, geo-location, or other data. Elasticsearch goes further than other vector databases with a full suite of search capabilities: filters and faceting, document level security, on-prem or cloud deployment, and more.Get relevant semantic search out of the box across domains with the Elastic Learned Sparse Encoder model. Implement it easily with a single click when setting up your new search application. Query expansions with related keywords and relevance scores make the model easily understood and ready for prime time on any dataset — no fine-tuning required.Incorporate your proprietary, business-specific information with LLMs so that generative AI applications don’t have to simply rely on publicly trained data. Elasticsearch is your data source for highly relevant search results that enhances the quality of LLM output via context window. Integrate with generative AI or your preferred LLM using Elasticsearch’s APIs and plugins.Deliver generative AI experiences with better context for customers and employees. Elastic provides generative AI models with relevant search results from your data using retrieval augmented generation (RAG).When users query your application, Elastic provides relevant search results pulled from the data you have stored in Elasticsearch. These secure results, which contain proprietary context from your organization, get passed to the generative AI model to create more accurate responses for end-users.Create a generative AI experience that's tailored to your own business and end-user needs. Elastic connects your datastore — whether it's a database, knowledge base, or case history — with large language models like OpenAI ChatGPT, Google Bard, and Hugging Face. Have your own transformer model? Bring it and manage it within Elastic. Using Langchain to build your app? We can integrate with your preferred open source frameworks too.Use Elasticsearch with large language models (LLMs) to create powerful, new applications for your customers and employees. Tailor generative AI experiences to your business using real-time, proprietary data. Build cost-effective and secure AI apps that are accurate and relevant using Elastic’s vector database, out of the box semantic search, and transformer model flexibility. The future is possible today with Elastic.Review queues show you posts one at a time so that you can evaluate what, if any, action is needed.

"""

print ("The length of the paragraph is %s characters" % len (chunk))

docs2 = [{"text_field": chunk}]

### Check token size limit for embedding a string

ml_model=".elser_model_2"

chunk_vector = client.ml.infer_trained_model(model_id=ml_model, docs=docs2, )

print(chunk_vector['inference_results'][0])

print("Embedding size : {}".format(len(chunk_vector['inference_results'][0]['predicted_value'])))

if chunk_vector['inference_results'][0]['is_truncated']:

print(" **** We exceeded the model token limit ******* ")

else:

print(" **** We NOT exceeded the model token limit ******* ")

Here is the ouput

The length of the paragraph is 3625 characters
{'predicted_value': {'rein': 2.1909235, 'elastic': 2.0344632, 'token': 1.8880011, 'else': 1.7904589, 'expansion': 1.7321781, '##de': 1.6252898, 'rag': 1.6193763, '##r': 1.5961617, 'genera': 1.567829, 'document': 1.5452492, '##xing': 1.537078, 'll': 1.4878062, 'sparse': 1.4574037, '##x': 1.3799739, 'exclude': 1.3786075, 'source': 1.3693296, 'text': 1.3683709, '##code': 1.3578047, '##sea': 1
....
 'certification': 0.026038108, '##d': 0.025774192, 'elimination': 0.025767686, 'html': 0.024397722, 'clicking': 0.024229601, 'scope': 0.021584367, 'rights': 0.018045416, 'managed': 0.017101327, 'log': 0.015965834, 'class': 0.008823808, 'knowledge': 0.006527294, '##ima': 0.0056298743, 'd': 0.004097282, '##ulation': 0.0025947972, 'future': 0.001865791}, 'is_truncated': True}
**Embedding size : 276**
** **** We exceeded the model token limit *********

Hi @Sean_Story

Hmm I see the same thing 8.11.4

PUT my-index
{
  "mappings": {
    "properties": {
      "content_embedding": { 
        "type": "sparse_vector" 
      },
      "content": { 
        "type": "text" 
      }
    }
  }
}

PUT _ingest/pipeline/elser-v2-test
{
  "processors": [
    {
      "inference": {
        "model_id": ".elser_model_2",
        "input_output": [ 
          {
            "input_field": "content",
            "output_field": "content_embedding"
          }
        ]
      }
    }
  ]
}

# This is about 435 Tokens... plus a couple ~10 line feeds what am I missing
POST _ingest/pipeline/elser-v2-test/_simulate?verbose
{
  "docs": [
    {
      "_source": {
        "content": """
The tokens generated by ELSER must be indexed for use in the text_expansion query. However, it is not necessary to retain those terms in the document source. You can save disk space by using the source exclude mapping to remove the ELSER terms from the document source.
Reindex uses the document source to populate the destination index. Once the ELSER terms have been excluded from the source, they cannot be recovered through reindexing. Excluding the tokens from the source is a space-saving optimsation that should only be applied if you are certain 
that reindexing will not be required in the future! It’s important to carefully consider this trade-off and make sure that excluding the ELSER terms from the source aligns with your specific requirements and use case. Review the Disabling the _source field and Including / Excluding 
fields from _source sections carefully to learn more about the possible consequences of excluding the tokens from the _source. Get the foundation for a full vector search experience and generative AI integration. Use a single platform to create, store, and search embeddings for dense 
retrieval and capture your unstructured data’s meaning and context — across text, images, videos, audio, geo-location, or other data. Elasticsearch goes further than other vector databases with a full suite of search capabilities: filters and faceting, document level security, on-prem or 
cloud deployment, and more.Get relevant semantic search out of the box across domains with the Elastic Learned Sparse Encoder model. Implement it easily with a single click when setting up your new search application. Query expansions with related keywords and relevance scores make the model 
easily understood and ready for prime time on any dataset — no fine-tuning required.Incorporate your proprietary, business-specific information with LLMs so that generative AI applications don’t have to simply rely on publicly trained data. Elasticsearch is your data source for highly relevant 
search results that enhances the quality of LLM output via context window. Integrate with generative AI or your preferred LLM using Elasticsearch’s APIs and plugins.Deliver generative AI experiences with better context for customers and employees. Elastic provides generative AI models with relevant 
search results from your data using retrieval augmented generation (RAG).When users query your application, Elastic provides relevant search results pulled from the data you have stored in Elasticsearch. These secure results, which contain proprietary context from your organization, get passed to the
search results from your data using retrieval augmented generation (RAG).When users query your application, Elastic provides relevant search results pulled from the data you have stored in Elasticsearch. These secure results, which contain proprietary context from your organization, get passed to the
 """
      }
    }
  ]
}

# Result

{
  "docs": [
    {
      "processor_results": [
        {
          "processor_type": "inference",
          "status": "success",
          "doc": {
            "_index": "_index",
            "_version": "-3",
            "_id": "_id",
            "_source": {
              "is_truncated": true, <!--- HERE 
              "content_embedding": {
                "expanding": 0.39319664,
                "##d": 0.025774308,
......
                "user": 0.49159336,
                "customer": 0.14560525
              },
              "model_id": ".elser_model_2",
              "content": """
....
}

Thanks for looking into this. I am using '8.11.1'.

Sorry, i do not fully understand your conclusion.

Are you saying that the Elser2 embeddings are getting trucated before 512 tokens, where as the documentation says it can embedd upto512 tokens ?

Should this needs to be investigated by Elastic Development team to understand the cause ?

@SanthoshKMurugadass , first, I want to clarify that

print ("The length of the paragraph is %s characters" % len (chunk))

is counting the characters in your text, not the tokens. Tokens are hard to compute in just a few lines of code, but if we simpify and pretend that whitspace separation makes a token:

>>> chunks = chunk.split(" ")
>>> len(chunks)
547

Since that's over 512, I would expect it to be truncated, but depending on how the tokenizer works, it may be under 512 (I'm not sure if stop words are counted, for example).

print("Embedding size : {}".format(len(chunk_vector['inference_results'][0]['predicted_value'])))

This however, has NOTHING to do with how many tokens were analyzed. ELSER is a Sparse model - you're going to have relatively few dimensions, and these aren't going to repeat. So you're not getting one output dimension for every token that went in. I'd expect you to never have 512 output dimensions.

Hmm I see the same thing 8.11.4

@stephenb 's example is more clearly puzzling to me, just because it has the "is_truncated": true for some input text that should be clearly under 512 tokens. Again, the way the model does tokenization might be more nuanced than just whitespace, I'll see if I can get one of the experts to weigh in on what's happening with that example.

Thanks for continuous help.

I understand that below statement prints number of characters, this is just for reference. And I am not expecting this number to be same as number of Token after tokenzing this paragraph. Clear and no issues.

print ("The length of the paragraph is %s characters" % len (chunk))

But below print statements should print number of tokens as perceived by ELSER2 whatever logic that ELSER follows. Isn't it ?

print("Embedding size : {}".format(len(chunk_vector['inference_results'][0]['predicted_value'])))

My confusion is same as you, why is **istruncated=**TRUE before number of tokens hit 512 ?

Any expert help in bringing clarity on this would help me effectively use ELSER2, otherwise without clear understanding it is not fair to use this ELSER2 at all.

@SanthoshKMurugadass That is exactly what @Sean_Storyis looking into internally please be patient

Great, thanks a lot for investigating.

No, that's not correct. That's not the number of tokens, that's the number of output dimensions. There is NOT a 1:1 ratio between input tokens and output dimensions. the is_truncated: true should relate to the size of the input, not the output. So for your example (547 input tokens) truncation is expected. In Stephenb's example (435 input tokens) truncation is not expected. As Stephen says, I'm looking into that part still, and will get back to you. :slight_smile:

Thanks @Sean_Story for explaining.

Spiliting it by whitespace is approximate, i understand.

Is there anyway to understand what are those 512 tokens considered and what portion of the input is truncated? This infromation help in exploring appropriate chunking strategy.

Thanks for looking in Stephenb's example (435 input tokens), i wait. Perhpas the understanding from this, would help to answer my above question also.

The elser_model_2 uses the BERT tokenizer to convert the text inputs into numerical tokens. Hugging Face Transformers has a Python implementation of the BERT tokenizer you can use to split the text.

You will need to install Transformers in your Python env to get started.

Once installed you can use this Python snippet to tokenize your inputs

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "your text input"

# Tokenize text
tokenizer.encode(text)

# number of tokens
len(tokenizer.encode(text))

# Truncate at 512 tokens
first_512_tokens = tokenizer.encode(text, max_length=512)

# Decode the first 512 tokens
tokenizer.decode(first_512_tokens)

When you decode the tokens you will see the special values [CLS] at the beginning and [SEP] at the end. These 2 tokens are always inserted so the true max number of tokens for elser_model_2 is 510 as we have to account for those special tokens.

3 Likes

Thanks @dkyle!

@SanthoshKMurugadass, using this method, we can see that @stephanb's example actually has 564 tokens (not the 435 value we got when just tokenizing by whitespace). @dkyle also shared offline that a english_words * 1.3 is a reasonable way to estimate how many tokens will be created by the BERT tokenizer.

Hopefully this has answered your questions!

1 Like

Thanks @dkyle , your explanation is clear and it answers my question.
I thanks all contributed here to gain clear understanding. Really appreciate the quality of the discussion here.

I have a request, please see if it is possible to implement in future

Injest document with embeddings,
response = client.index(index=index_Name_emb, body=docs2)

Please see if the response can contain the information of Number of Tokens and IsTruncated value. So that it comes easy to keep track for every chunk.

2 Likes

Thanks for providing that feedback! We started having the same discussion internally based on this thread. No promises, but we're definitely considering it.

Thanks for thinking alike and considering implementations posibilities.

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.