Embedding token size limit for ELSER2 model

The elser_model_2 uses the BERT tokenizer to convert the text inputs into numerical tokens. Hugging Face Transformers has a Python implementation of the BERT tokenizer you can use to split the text.

You will need to install Transformers in your Python env to get started.

Once installed you can use this Python snippet to tokenize your inputs

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
text = "your text input"

# Tokenize text
tokenizer.encode(text)

# number of tokens
len(tokenizer.encode(text))

# Truncate at 512 tokens
first_512_tokens = tokenizer.encode(text, max_length=512)

# Decode the first 512 tokens
tokenizer.decode(first_512_tokens)

When you decode the tokens you will see the special values [CLS] at the beginning and [SEP] at the end. These 2 tokens are always inserted so the true max number of tokens for elser_model_2 is 510 as we have to account for those special tokens.

3 Likes