How can I aggregate terms by their tf-idf score in elasticsearch?

apanimesh061 · June 21, 2016, 5:58pm

Suppose I run a query which returns a total of 1000 documents and want to aggregate the top 500 documents with terms sorted in order of their tf-idf scores.

Is it possible to do that in Elasticsearch?

I am using v2.3.3.

polyfractal · June 21, 2016, 8:06pm

I'm not sure I understand what you mean by "terms sorted in order of their tf-idf score"? Hits are already returned in relevance ordering.

Are you wanting a list of the top 500 terms from the matching 1000 documents?

Could you perhaps give an example of what you're looking for?

Mark_Harwood · June 21, 2016, 8:07pm

TF is a per-document score so it doesn't make sense to have a unique list of terms each with a single score that includes any notion of TF.
See the "explain" api instead https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-request-explain.html

apanimesh061 · June 21, 2016, 8:20pm

Hi,
Thanks for the reply.

Yes you are right I want the top 50 terms from the matching documents?

I added this to the query I was running:

"aggregations": {
    "importantTerms": {
      "terms": {
        "size": 25,
        "field" : "title"
      }
    }
  }

I got the terms aggregated and sorted by doc_count.

I want the sorting to be done by the tf * idf value instead. Is it even possible to get tf and idf of a particular term this way?

I also tried significant_termsbut it is just too slow.

Just to add to this, the terms that I get are unigrams, is there a way to get bigrams?

Mark_Harwood · June 22, 2016, 6:56am

Try wrapping it in the sampler aggregation to focus the inspection on only the top N docs rather than all.

apanimesh061 · June 22, 2016, 2:12pm

@Mark_Harwood
I am not sure if this aggregation is correct:

"aggs": {
        "sample": {
            "sampler": {
                "shard_size": 10,
                "field" : "title"
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "title"
                    }
                }
            }
        }
    }

When I run this query I get 0 keywords and a CircuitBreakingException.

Is there something I am not doing correctly?

Mark_Harwood · June 22, 2016, 2:16pm

Try this:

"aggs": {
		"sample": {
			"sampler": {
				"shard_size": 1000
			},
			"aggs": {
				"keywords": {
					"significant_terms": {
						"field": "title"
					}
				}
			}
		}
	}

You want a reasonable sample size to get any sensible stats (like a survey - you wouldn't sample just 10 people). You don't need the "field" property - that only gets used to control diversity in the sample.

apanimesh061 · June 22, 2016, 2:24pm

@Mark_Harwood

Thanks for the suggestion.

I got a result like:

{
  "took": 927,
  "timed_out": false,
  "_shards": {
    "total": 24,
    "successful": 24,
    "failed": 0
  },
  "hits": {
    "total": 79,
    "max_score": 0,
    "hits": []
  },
  "aggregations": {
    "sample": {
      "doc_count": 79,
      "keywords": {
        "doc_count": 79,
        "buckets": [
          {
            "key": "data",
            "doc_count": 3,
            "score": 234.19592738702784,
            "bg_count": 6916
          }
        ]
      }
    }
  }
}

I only get one term. I tried increasing the shard_size from 1000 to 10,000 and 20,000 but I got the same result. I also tried adding "size": 10, to the aggregations but it is not helping.

Is it right to be getting only one significant_term?

Mark_Harwood · June 22, 2016, 2:29pm

I don't know what your search is but it only matches 79 documents in total across 24 shards.
That could mean each shard is looking for statistically significant changes in a sample that may be as small as only 2 or 3 docs. That doesn't provide enough of a signal. You could lower the min_doc_count and shard_min_doc_count from the default settings but that likely won't help - we need a reasonable number of docs before the recommendations will be any good.

Topic		Replies	Views
Facet query sorted by tf*idf Elasticsearch	3	355	July 6, 2017
Raw tf-idf Elasticsearch	6	1149	August 3, 2017
Term Aggregation and displaying top scored document for each term Elasticsearch	7	418	May 29, 2018
A question around to get relevant content By using TF-IDF algorithm Elasticsearch	1	242	November 9, 2021
Customized document to term scoring Elasticsearch	1	355	July 30, 2020

How can I aggregate terms by their tf-idf score in elasticsearch?

Related topics