Stored term vectors still slow when retrieving their scores (terms filtering)

Hi all,

I want to get the most characteristic words for each document I stored.
The most straightforward way I tried this was (calling from python with es my ES-instance) looping over all my ids:
es.termvectors(index = INDEX_NAME,doc_type=TYPE_NAME,id=ii,field_statistics=True,fields =fields4TermVec,term_statistics=True,dfs= False,positions=False,offsets=False,body=bod_4TermVecs)

fields4TermVec contains 14 fields
bod_4TermVecs={ "filter" : { "max_num_terms" : 52 "min_term_freq" : 2, "min_doc_freq" : 1 }}

this turns out to be not feasible, since way too slow.

So I thought that if I reindex into a new index with the mapping including for the 14 fields in field4TermVec having set "term_vector": "yes" I would get a substantial increase in computing speed. But this was not the case, anyone knows why?

The only reason I come up with is that it still needs to compute the doc_freq and term_freq and so storing the term vectors does not make a difference.

Right - there's information that can be stored with a document that is static (e.g. the frequency of the term in the document) and there's stuff that is dynamic as the index changes such as the number of docs that contain a term. The latter requires a lot of look-ups to gather frequencies. Lookups=random disk seeks=slow.

1 Like

Ok, so it is like I thought. Thanks a lot!

@Mark_Harwood, you're right that there are lookups, but I think the implementation could be improved by storing per-request Map<String,TermStatistics> so that if you ask for TVs with filtering of many docs (MultiTVRequest), such local (transient) cache could be used to avoid looking up same terms over and over, as well potentially reducing the seeks that are performed per document and term. I ran into the same issue today ... :slight_smile:

We do this sort of caching when running significant_terms aggregations. Could you achieve the results you need using significant_terms? I know that currently relies on memory-hungry fielddata but that's something I'm working on

I read about significant_terms and I don't think it can help me. I need to fetch the TermVectors of multiple documents (say 100), and in order to reduce the returned payload size, I would like to return only the top-K tf-idf scoring terms (if docs have few 1000s of unique terms, and K=20,50,100, I expect to get a much smaller payload).

Perhaps I'm missing something about significant_terms, but it doesn't look like it addresses this requirement.

Here's a query for example docs, referring to them directly by a unique ID and asking for significant terms:

GET signalmedia/_search
  "query": {
	"terms": {
	  "my_id": [
  "size": 0,
  "aggs": {
	"keywords": {
	  "significant_terms": {
		"field": "content",
		"size": 20

Here's the results (they happen to be docs about elasticsearch):

  "took": 84,
  "aggregations": {
	"keywords": {
	  "doc_count": 9,
	  "buckets": [
		  "key": "logstash",
		  "doc_count": 3,
		  "score": 27777.44444444444,
		  "bg_count": 4
		  "key": "elasticsearch",
		  "doc_count": 8,
		  "score": 22574.067019400354,
		  "bg_count": 35
		  "key": "kibana",
		  "doc_count": 3,
		  "score": 22221.888888888887,
		  "bg_count": 5

This requires fielddata:true and can be costly which is why I'm busy working on a significant_text agg that retokenizes top matches on the fly from the stored source.

Thanks for the example, however this returns the top terms for all queried documents, and not the top ones per document (as it's an aggregation), which is what I need... any way to do that with that agg?

No. If I understand your question correctly you see each doc as independent and the keywords you want for doc 1 (which might be about fish) is not influenced in any way by your other choice of docs which might be about something entirely different like bicycles or chocolate?
i.e. you might as well make separate requests for each doc, were it not for the added network costs. If so, maybe try the "more like this" query, and set size:1 and explain:truesetting to see what the MLT logic picked out as the interesting keywords in the example doc.

This issue is about the slowness of when using terms filtering when retrieving the TVs of multiple documents. I think we agree that the implementation can be improved, right?

About your proposal, my current use case is this: I execute a query Q and retrieve N results. For each I would like to fetch the term vectors and do some post-processing at the client side. I use a multi-TV request, so I only have one additional round-trip to the server.

Due to network latency, fetching those TVs (think top 100 docs, each has few hundreds to thousands of unique terms) is slow (big response payload + network latency). So I thought to retrieve only the top-K terms of each document, in order to reduce the size of the TV response payload. However, due to the current implementation, the response time of terms filtering is actually much higher (and that's something I measured on my local laptop, i.e. no network latency...).

I will consider the MLT approach you mentioned, but I don't have an example document and I don't want to issue a request per document.

Do you see any reason not to improve the implementation in ES, when serving a multi-TV request with terms filtering?

Ah yes. OK so even unrelated docs will share terms [and, of, the, if, when, I, you, .....] and you want to avoid looking those words up multiple times. Got it.

OK so there is a common theme to the docs - they all match the same query. If you replace the ids in my previous significant_terms example with your choice of query (e.g. "bird flu") then analyzing those docs should spot "h5n1".
You would need to do a follow-up query though for those top docs and a terms agg with an include clause listing the significant terms in order to discover which docs had those keywords.

The advantage of looking across the docs as a set rather than as individual hits is you can figure out that H5N1 is highly significant when it might be mentioned only once in a handful of docs (low TF).

Significant_text is intended to tackle this sort of thing and adds sequence de-duplication which I see as necessary for use on typical real-world text.

Thanks @Mark_Harwood. I intend to experiment with significant_terms more, as it looks interesting (it's more than the simple 'terms' aggregation that I thought it is before). So far, and without diving too deep into it, it's not that fast (4-5 seconds on my laptop, against a local index and 500K docs, one-word query), but I still need to experiment with it, so I don't mind the times too much yet. And I know you're working on improving it.

Parallel to that though, the TV I fetch for each result document is taken as its profile, and by taking only the top-K terms I consider them to be a truncated-profile and that's still required by my application. Therefore I do wish the implementation of multi-TV with filters will be improved.

If I took a stab at it, do you think it's something that you would consider having in the code? I may not get to it right away, but if you think positively about this improvement, I'll try to allocate some time to it, as I do rely on that API.

So sampling with the sampler (or diversified_sampler) aggregation is not only beneficial to performance but also results quality.
Another key issue with most real-world text content is that of the various forms of content duplication that throw off statistical analysis. Check out the approach used in this new significant_text aggregation coming in 6.0 that deals with on-the-fly de-duplication on real-world examples: