I have a corpus of documents indexed . I also stored the term vectors when indexing. Now I want to retrieve term vectors of all documents satisfying some filtering options.
I was able to get term vector for a single document or for a set of documents by providing the document IDs. But is there a way to get term vectors for all the documents without providing document IDs?
Eventually what I want to do is to get the frequency counts of all the terms in a field, for all documents in an index (i.e., a bag of words matrix).
There is no way to aggregate on term_vectors. The only way to retrieve term_vectors that I'm aware of is per document id, the way to retrieve them for all documents matching a query would be to run a search scroll and retrieve term_vectors for each document returned by id. Actually there's also the multi term vector api that allows to retrieve term_vectors for multiple documents at the same time which is a better fit, so you could batch them.
Thanks!
Yes, I tried the multi-termvector approach, but still I have to provide the list of document IDs which is huge, in the order of hundreds of millions.
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant
logo are trademarks of the
Apache Software Foundation
in the United States and/or other countries.