Hi everyone,
I had two quick questions about ELSER’s tokenization behavior:
-
According to this Elasticsearch doc, both documents and queries have a 512-token limit for the ELSER model. Does this tokenization step happen every time a query is executed, even if I’m querying the same index on which I previously ran semantic search?
-
I’m also trying to understand how ELSER tokenizes natural language. Would using this Elasticsearch API give an accurate estimate of the token count? And does ELSER’s tokenization resemble LLM-style tokenization (e.g., ~4 characters ≈ 1 token), or is it fundamentally different?
Thanks so much in advance! Would greatly appreciate any thoughts/insights.