I posted a related question about this same issue at SO (http://bit.ly/1LIdc6y) and received some helpful advice, but I thought I'd post here just in case anybody has some more insight into this.
I'm working on building an application that uses Elasticsearch with Apache Spark. I'm trying to use ES to store/index the documents for query purposes and also use the ES analyzers to process the documents for machine learning (I know that ES is not really built specifically for this). Basically, I need to pull the (ES-analyzed) tokens from each document into Spark.
I know you can get the tokens and counts of each token per document in two ways: through the term_vector API and through the Analyze API. However, both of those are very slow and efficient for large datasets since they have to do a REST call for each document.
My question is this: Is there a way to get the information from the term_vector API returned as a result of the search query itself (through some setting within the request body for example)? I'm mainly interested in the tokens and their frequencies within each document. The closest I've seen is specifying the "fielddata_fields" option for my text field. This manages to return the tokens themselves but not the token frequencies within the document(s). Is there a way to return both using only the search query?