_mtermvectors very slow?

(Matthew Haugen) #1


I've got an ML pipeline that involves running a query against Elasticsearch, pulling down the resultant IDs, then passing those (in chunks) to _mtermvectors to retrieve full tokenized results. Notably, I'm not looking for the "most interesting" terms, or anything like that: I just want to benefit from Elasticsearch's awesome tokenization platform, since we've already got our data there anyway.

I have this all in place, but I find it extremely slow. Pulling back 1,000 documents' term vectors can take minutes (although around one minute is pretty typical). I understand quite a bit of data needs to be passed around here, and I probably won't get it down to milliseconds, but this surprised me. I could pull back the 1,000 raw source data fields in a few seconds.

I'm running this in a VNet on Azure. I don't know the exact specs, but it's not like it's running over 3G--I don't think network transfer is the biggest expense here.

I set up my mapping with term_vector enabled.

"text": {
    "term_vector": "yes",
    "type": "string",
    "fields": { ... }

And query against it like this:

POST /esproxy/cases/_mtermvectors

    "docs": [{
        "_index": "cases",
        "_type": "casedocument",
        "_id": "1346619000",
        "filter": {
            "max_num_terms": 1000
    }, {
        "_index": "cases",
        "_type": "casedocument",
        "_id": "1346620986",
        "filter": {
            "max_num_terms": 1000

I've tried it with several batch sizes, including 1, 100, and 1000, and the time scales more or less linearly. I save a few seconds by doing larger batches, as expected, but nothing game-changing in either direction.

So, my questions:

  1. Is this sort of performance just normal? It's possible that I'm reading into it way too much, and this is just how long it takes to move data around.
  2. Is there a better way to get the raw tokenized data out of Elasticsearch? If nothing else, that "max_num_terms": 1000 seems insanely sketchy to me, but I haven't found a way to mark it as unlimited.
  3. Depending on the answers here, I might look into putting the ML into a plugin directly against Elasticsearch (which would be independently cool, but would limit our other capabilities a bit). Could that help?

For what it's worth, I'm potentially running this against 100,000 documents (ideally more, but I've got sign-off to limit to that), so any savings would help.

I've got a cluster with three nodes, each at 28GB of RAM. I would understand if that was my limiting factor, and I can accept that as an answer if it is. I can't "just try more," and I'd love better solutions, but I'd at least understand.

My quick-and-dirty solution has been to run the calls in parallel, which is fine with me, but I'd love to make each thread return a little more quickly.


Circuit breaker while using sampler?
(system) #2