Question on Semantic Search ELSER Model's Tokenization

Hi everyone,

I had two quick questions about ELSER’s tokenization behavior:

  1. According to this Elasticsearch doc, both documents and queries have a 512-token limit for the ELSER model. Does this tokenization step happen every time a query is executed, even if I’m querying the same index on which I previously ran semantic search?

  2. I’m also trying to understand how ELSER tokenizes natural language. Would using this Elasticsearch API give an accurate estimate of the token count? And does ELSER’s tokenization resemble LLM-style tokenization (e.g., ~4 characters ≈ 1 token), or is it fundamentally different?

Thanks so much in advance! Would greatly appreciate any thoughts/insights.