_mtermvectors very slow?

MaJaHa95 · August 26, 2016, 6:54pm

Hello,

I've got an ML pipeline that involves running a query against Elasticsearch, pulling down the resultant IDs, then passing those (in chunks) to _mtermvectors to retrieve full tokenized results. Notably, I'm not looking for the "most interesting" terms, or anything like that: I just want to benefit from Elasticsearch's awesome tokenization platform, since we've already got our data there anyway.

I have this all in place, but I find it extremely slow. Pulling back 1,000 documents' term vectors can take minutes (although around one minute is pretty typical). I understand quite a bit of data needs to be passed around here, and I probably won't get it down to milliseconds, but this surprised me. I could pull back the 1,000 raw source data fields in a few seconds.

I'm running this in a VNet on Azure. I don't know the exact specs, but it's not like it's running over 3G--I don't think network transfer is the biggest expense here.

I set up my mapping with term_vector enabled.

...
"text": {
    "term_vector": "yes",
    "type": "string",
    "fields": { ... }
}
...

And query against it like this:

POST /esproxy/cases/_mtermvectors
        ?term_statistics=false
        &positions=false
        &field_statistics=false
        &offsets=false
        &payloads=false
        &realtime=false
        &filter_path=docs._id%2Cdocs.term_vectors
        &fields=text

{
    "docs": [{
        "_index": "cases",
        "_type": "casedocument",
        "_id": "1346619000",
        "filter": {
            "max_num_terms": 1000
        }
    }, {
        "_index": "cases",
        "_type": "casedocument",
        "_id": "1346620986",
        "filter": {
            "max_num_terms": 1000
        }
    },
    ...]
}

I've tried it with several batch sizes, including 1, 100, and 1000, and the time scales more or less linearly. I save a few seconds by doing larger batches, as expected, but nothing game-changing in either direction.

So, my questions:

Is this sort of performance just normal? It's possible that I'm reading into it way too much, and this is just how long it takes to move data around.
Is there a better way to get the raw tokenized data out of Elasticsearch? If nothing else, that "max_num_terms": 1000 seems insanely sketchy to me, but I haven't found a way to mark it as unlimited.
Depending on the answers here, I might look into putting the ML into a plugin directly against Elasticsearch (which would be independently cool, but would limit our other capabilities a bit). Could that help?

For what it's worth, I'm potentially running this against 100,000 documents (ideally more, but I've got sign-off to limit to that), so any savings would help.

I've got a cluster with three nodes, each at 28GB of RAM. I would understand if that was my limiting factor, and I can accept that as an answer if it is. I can't "just try more," and I'd love better solutions, but I'd at least understand.

My quick-and-dirty solution has been to run the calls in parallel, which is fine with me, but I'd love to make each thread return a little more quickly.

Thanks,
Matthew

Topic		Replies	Views
Stored term vectors still slow when retrieving their scores (terms filtering) Elasticsearch	13	1485	July 5, 2017
Unexpected behaviour of termvector API in Python Elasticsearch	1	416	May 19, 2020
Term vectors for nested fields Elasticsearch	2	320	December 12, 2022
Elasticsearch mtermvector with filter query Elasticsearch	1	496	July 5, 2017
Do term vectors accelerate phrase queries? Elasticsearch	1	255	March 9, 2022

_mtermvectors very slow?

Related topics