Accessing Unique Token or Term ID

neal · November 30, 2015, 8:12pm

Does ES assign a token id to each token / term? I imagine that it does.

Is there any way to get the token ID? I've tried through the term vector API, but it seems that IDs are not a valid field.

So, I'd like to do something like:

$ curl http://localhost:9200/twitter/_term?t='test'

The response back would be something like

'{ 'term':'test', '_id':12938}

Any ideas?

Thanks!

vtst2412 · November 30, 2015, 8:40pm

I have never seen the _term function before. Why aren't you using _search?

How about:

GET /twitter/_search?q=term:test

FYI: _id is an absolutely valid field to search on.

neal · November 30, 2015, 9:19pm

I just contrived of the _term function just now as an example of what I'm looking for. I don't think that it exists.

I'm not looking for a document id. I'm looking for a token id. There must be a token - id dictionary somewhere, and I'd like to get the id of a token.

colings86 · December 1, 2015, 9:11am

AFAIK there are no term IDs in the index. Terms can be uniquely identified by there fieldname and value. What problem are you wanting to solve? What are you wanting to use the 'token ID' for? Maybe there is another way to achieve what you are trying to do.

neal · December 1, 2015, 2:15pm

Thank you for your reply!

Well, I'm sure that the underlying Lucene Index converts terms and tokens to a uniq id over all of the documents. Either via hashing or just counting - there must be an token ID symbol table somewhere.

We're using Elasticsearch as an document store / index for alot of our machine learning / NLP analytics that we run via Spark. Instead of building and maintaining our own dictionary ID store (or worse yet - re assign ids for each subset of corpora that we work with ), we like to keep all the docs in ES and have a single source for tokenization / token ids.

While the TermVectors api is useful for us, it's not enough yet. We also would need tokenized docs. Is there a way to access those outside of term vectors? Something similar to UIMA CAS?

Keeping all of our docs / index / tokenization in a single place greatly simplifies our analytic processing. So, maybe ES just isn't the way to go, or we just build a separate token ID symbol table somewhere else.

jpountz · December 1, 2015, 5:30pm

Lucene and Elasticsearch do not assign numeric ids to tokens, they just use the term bytes to identify terms. For instance if you index an analyzed string, it will be broken into tokens that are essentially a char[] and then Lucene stores them in the index using their utf-8 representation.

neal · December 1, 2015, 6:04pm

I see!!!

Thank you for letting me know that! I was unaware of that approach before.
This is really helpful for me to know.

Topic		Replies	Views
Elasticsearch token_vector analysis over an entire field Elasticsearch	4	728	October 18, 2017
Is there a way to get all the tokens in the term vector of an index Elasticsearch	3	2630	July 5, 2017
How to get tokens for a document? Elasticsearch	2	17235	January 2, 2018
Word count/frequency per field Elasticsearch	3	3369	January 10, 2019
Returning term_vector info within a search query Elasticsearch	2	988	July 5, 2017

Accessing Unique Token or Term ID

Related topics