How are interger numbers treated in ElasticSearch?

Thank you for your reply! Excuse me for my misuse of terminology. I'm only a freshman to NLP and IR.

To be more precious, I will examplify with a toy example. For example, if I want to index a document "I really like Elasticsearch" in string format, then the tokenizer may map the tokens in this document to the corresponding ID. Let's say, "1 39 32 380188802", where "I" use the the ID 1 in vocabulary.
If I query "Elasticsearch", it's similarily mapped into "380188802".

So, now the query and document are all mapped to its IDs representation, but still in string format before feed into elasticsearch. What I want to know is, how are this format of documentt ("1 39 32 380188802") indexed, is it splited by " " and tokenized to ["1", "39", "32", "380188802"] where each string format interger is treated as a word to index? Or there are more heuristics to tackle this type of input?

1 Like