Hello,
I've got an ML pipeline that involves running a query against Elasticsearch, pulling down the resultant IDs, then passing those (in chunks) to _mtermvectors to retrieve full tokenized results. Notably, I'm not looking for the "most interesting" terms, or anything like that: I just want to benefit from Elasticsearch's awesome tokenization platform, since we've already got our data there anyway.
I have this all in place, but I find it extremely slow. Pulling back 1,000 documents' term vectors can take minutes (although around one minute is pretty typical). I understand quite a bit of data needs to be passed around here, and I probably won't get it down to milliseconds, but this surprised me. I could pull back the 1,000 raw source data fields in a few seconds.
I'm running this in a VNet on Azure. I don't know the exact specs, but it's not like it's running over 3G--I don't think network transfer is the biggest expense here.
I set up my mapping with term_vector enabled.
...
"text": {
    "term_vector": "yes",
    "type": "string",
    "fields": { ... }
}
...
And query against it like this:
POST /esproxy/cases/_mtermvectors
        ?term_statistics=false
        &positions=false
        &field_statistics=false
        &offsets=false
        &payloads=false
        &realtime=false
        &filter_path=docs._id%2Cdocs.term_vectors
        &fields=text
{
    "docs": [{
        "_index": "cases",
        "_type": "casedocument",
        "_id": "1346619000",
        "filter": {
            "max_num_terms": 1000
        }
    }, {
        "_index": "cases",
        "_type": "casedocument",
        "_id": "1346620986",
        "filter": {
            "max_num_terms": 1000
        }
    },
    ...]
}
I've tried it with several batch sizes, including 1, 100, and 1000, and the time scales more or less linearly. I save a few seconds by doing larger batches, as expected, but nothing game-changing in either direction.
So, my questions:
- Is this sort of performance just normal? It's possible that I'm reading into it way too much, and this is just how long it takes to move data around.
 - Is there a better way to get the raw tokenized data out of Elasticsearch? If nothing else, that 
"max_num_terms": 1000seems insanely sketchy to me, but I haven't found a way to mark it as unlimited. - Depending on the answers here, I might look into putting the ML into a plugin directly against Elasticsearch (which would be independently cool, but would limit our other capabilities a bit). Could that help?
 
For what it's worth, I'm potentially running this against 100,000 documents (ideally more, but I've got sign-off to limit to that), so any savings would help.
I've got a cluster with three nodes, each at 28GB of RAM. I would understand if that was my limiting factor, and I can accept that as an answer if it is. I can't "just try more," and I'd love better solutions, but I'd at least understand.
My quick-and-dirty solution has been to run the calls in parallel, which is fine with me, but I'd love to make each thread return a little more quickly.
Thanks,
Matthew