Hello,
I've got an ML pipeline that involves running a query against Elasticsearch, pulling down the resultant IDs, then passing those (in chunks) to _mtermvectors to retrieve full tokenized results. Notably, I'm not looking for the "most interesting" terms, or anything like that: I just want to benefit from Elasticsearch's awesome tokenization platform, since we've already got our data there anyway.
I have this all in place, but I find it extremely slow. Pulling back 1,000 documents' term vectors can take minutes (although around one minute is pretty typical). I understand quite a bit of data needs to be passed around here, and I probably won't get it down to milliseconds, but this surprised me. I could pull back the 1,000 raw source
data fields in a few seconds.
I'm running this in a VNet on Azure. I don't know the exact specs, but it's not like it's running over 3G--I don't think network transfer is the biggest expense here.
I set up my mapping with term_vector
enabled.
...
"text": {
"term_vector": "yes",
"type": "string",
"fields": { ... }
}
...
And query against it like this:
POST /esproxy/cases/_mtermvectors
?term_statistics=false
&positions=false
&field_statistics=false
&offsets=false
&payloads=false
&realtime=false
&filter_path=docs._id%2Cdocs.term_vectors
&fields=text
{
"docs": [{
"_index": "cases",
"_type": "casedocument",
"_id": "1346619000",
"filter": {
"max_num_terms": 1000
}
}, {
"_index": "cases",
"_type": "casedocument",
"_id": "1346620986",
"filter": {
"max_num_terms": 1000
}
},
...]
}
I've tried it with several batch sizes, including 1, 100, and 1000, and the time scales more or less linearly. I save a few seconds by doing larger batches, as expected, but nothing game-changing in either direction.
So, my questions:
- Is this sort of performance just normal? It's possible that I'm reading into it way too much, and this is just how long it takes to move data around.
- Is there a better way to get the raw tokenized data out of Elasticsearch? If nothing else, that
"max_num_terms": 1000
seems insanely sketchy to me, but I haven't found a way to mark it as unlimited. - Depending on the answers here, I might look into putting the ML into a plugin directly against Elasticsearch (which would be independently cool, but would limit our other capabilities a bit). Could that help?
For what it's worth, I'm potentially running this against 100,000 documents (ideally more, but I've got sign-off to limit to that), so any savings would help.
I've got a cluster with three nodes, each at 28GB of RAM. I would understand if that was my limiting factor, and I can accept that as an answer if it is. I can't "just try more," and I'd love better solutions, but I'd at least understand.
My quick-and-dirty solution has been to run the calls in parallel, which is fine with me, but I'd love to make each thread return a little more quickly.
Thanks,
Matthew