AFAIK there are no term IDs in the index. Terms can be uniquely identified by there fieldname and value. What problem are you wanting to solve? What are you wanting to use the 'token ID' for? Maybe there is another way to achieve what you are trying to do.
Well, I'm sure that the underlying Lucene Index converts terms and tokens to a uniq id over all of the documents. Either via hashing or just counting - there must be an token ID symbol table somewhere.
We're using Elasticsearch as an document store / index for alot of our machine learning / NLP analytics that we run via Spark. Instead of building and maintaining our own dictionary ID store (or worse yet - re assign ids for each subset of corpora that we work with ), we like to keep all the docs in ES and have a single source for tokenization / token ids.
While the TermVectors api is useful for us, it's not enough yet. We also would need tokenized docs. Is there a way to access those outside of term vectors? Something similar to UIMA CAS?
Keeping all of our docs / index / tokenization in a single place greatly simplifies our analytic processing. So, maybe ES just isn't the way to go, or we just build a separate token ID symbol table somewhere else.
Lucene and Elasticsearch do not assign numeric ids to tokens, they just use the term bytes to identify terms. For instance if you index an analyzed string, it will be broken into tokens that are essentially a char[] and then Lucene stores them in the index using their utf-8 representation.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.