Accessing Unique Token or Term ID


(Neal) #1

Does ES assign a token id to each token / term? I imagine that it does.

Is there any way to get the token ID? I've tried through the term vector API, but it seems that IDs are not a valid field.

So, I'd like to do something like:

$ curl http://localhost:9200/twitter/_term?t='test'

The response back would be something like

'{ 'term':'test', '_id':12938}

Any ideas?

Thanks!


(Vincent Tran) #2

I have never seen the _term function before. Why aren't you using _search?

How about:

GET /twitter/_search?q=term:test

FYI: _id is an absolutely valid field to search on.


(Neal) #3

I just contrived of the _term function just now as an example of what I'm looking for. I don't think that it exists.

I'm not looking for a document id. I'm looking for a token id. There must be a token - id dictionary somewhere, and I'd like to get the id of a token.


(Colin Goodheart-Smithe) #4

AFAIK there are no term IDs in the index. Terms can be uniquely identified by there fieldname and value. What problem are you wanting to solve? What are you wanting to use the 'token ID' for? Maybe there is another way to achieve what you are trying to do.


(Neal) #5

Thank you for your reply!

Well, I'm sure that the underlying Lucene Index converts terms and tokens to a uniq id over all of the documents. Either via hashing or just counting - there must be an token ID symbol table somewhere.

We're using Elasticsearch as an document store / index for alot of our machine learning / NLP analytics that we run via Spark. Instead of building and maintaining our own dictionary ID store (or worse yet - re assign ids for each subset of corpora that we work with ), we like to keep all the docs in ES and have a single source for tokenization / token ids.

While the TermVectors api is useful for us, it's not enough yet. We also would need tokenized docs. Is there a way to access those outside of term vectors? Something similar to UIMA CAS?

Keeping all of our docs / index / tokenization in a single place greatly simplifies our analytic processing. So, maybe ES just isn't the way to go, or we just build a separate token ID symbol table somewhere else.


(Adrien Grand) #6

Lucene and Elasticsearch do not assign numeric ids to tokens, they just use the term bytes to identify terms. For instance if you index an analyzed string, it will be broken into tokens that are essentially a char[] and then Lucene stores them in the index using their utf-8 representation.


(Neal) #7

I see!!!

Thank you for letting me know that! I was unaware of that approach before.
This is really helpful for me to know.


(system) #8