Thank you for your reply!
Well, I'm sure that the underlying Lucene Index converts terms and tokens to a uniq id over all of the documents. Either via hashing or just counting - there must be an token ID symbol table somewhere.
We're using Elasticsearch as an document store / index for alot of our machine learning / NLP analytics that we run via Spark. Instead of building and maintaining our own dictionary ID store (or worse yet - re assign ids for each subset of corpora that we work with ), we like to keep all the docs in ES and have a single source for tokenization / token ids.
While the TermVectors api is useful for us, it's not enough yet. We also would need tokenized docs. Is there a way to access those outside of term vectors? Something similar to UIMA CAS?
Keeping all of our docs / index / tokenization in a single place greatly simplifies our analytic processing. So, maybe ES just isn't the way to go, or we just build a separate token ID symbol table somewhere else.