Embedding token size limit for ELSER2 model

dkyle · January 23, 2024, 5:57pm

The elser_model_2 uses the BERT tokenizer to convert the text inputs into numerical tokens. Hugging Face Transformers has a Python implementation of the BERT tokenizer you can use to split the text.

You will need to install Transformers in your Python env to get started.

Once installed you can use this Python snippet to tokenize your inputs

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "your text input"

# Tokenize text
tokenizer.encode(text)

# number of tokens
len(tokenizer.encode(text))

# Truncate at 512 tokens
first_512_tokens = tokenizer.encode(text, max_length=512)

# Decode the first 512 tokens
tokenizer.decode(first_512_tokens)

When you decode the tokens you will see the special values [CLS] at the beginning and [SEP] at the end. These 2 tokens are always inserted so the true max number of tokens for elser_model_2 is 510 as we have to account for those special tokens.

Topic		Replies	Views
Vector search : when input is bigger than max_seq Elastic Search	5	94	August 28, 2024
Text embedding different in elastic search and python library Elasticsearch elastic-stack-machine-learning , vector-search	6	1113	July 5, 2023
ELSER2 \| Spell check before creating embeddings Elasticsearch elastic-stack-machine-learning	2	259	January 31, 2024
Vertex AI Gemini Embeddings Task Type Support in Inference API Elasticsearch	2	39	September 2, 2025
Elastic Search Tokenizer (for tf-idf) Elasticsearch	8	777	July 6, 2017

Embedding token size limit for ELSER2 model

Related topics