Hey, I'm new to this. I have some questions about Elasticsearch.
How are interger numbers treated in this lib? For example, year, age, month...etc.
Can I build the index with BERT-tokenizer tokenized documents where each document is represented by its token id (an integer ranging from 0 to 3000). And convert the queries correspondly to do retrieval?
Welcome to our community!
Elasticsearch is not a library.
I guess that depends on the mapping - Field data types | Elasticsearch Guide [8.0] | Elastic - and then how they are queried.
I'm not familiar with this approach, but that terminology is not something that is native to Elasticsearch.
Thank you for your reply! Excuse me for my misuse of terminology. I'm only a freshman to NLP and IR.
To be more precious, I will examplify with a toy example. For example, if I want to index a document "I really like Elasticsearch" in string format, then the tokenizer may map the tokens in this document to the corresponding ID. Let's say, "1 39 32 380188802", where "I" use the the ID 1 in vocabulary.
If I query "Elasticsearch", it's similarily mapped into "380188802".
So, now the query and document are all mapped to its IDs representation, but still in string format before feed into elasticsearch
. What I want to know is, how are this format of documentt ("1 39 32 380188802") indexed, is it splited by " " and tokenized to ["1", "39", "32", "380188802"] where each string format interger is treated as a word to index? Or there are more heuristics to tackle this type of input?
It depends on how the field is mapped.
If it's a text, then it'll treat it as one string, if it's a keyword it'll tokenise on spaces. It might even be an array, given it's a bunch of numbers.
That's up to you to tell Elasticsearch how to handle.
This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.