Suppose I run a query which returns a total of 1000 documents and want to aggregate the top 500 documents with terms sorted in order of their tf-idf
scores.
Is it possible to do that in Elasticsearch?
I am using v2.3.3
.
Suppose I run a query which returns a total of 1000 documents and want to aggregate the top 500 documents with terms sorted in order of their tf-idf
scores.
Is it possible to do that in Elasticsearch?
I am using v2.3.3
.
I'm not sure I understand what you mean by "terms sorted in order of their tf-idf
score"? Hits are already returned in relevance ordering.
Are you wanting a list of the top 500 terms from the matching 1000 documents?
Could you perhaps give an example of what you're looking for?
TF is a per-document score so it doesn't make sense to have a unique list of terms each with a single score that includes any notion of TF.
See the "explain" api instead https://www.elastic.co/guide/en/elasticsearch/reference/2.3/search-request-explain.html
Hi,
Thanks for the reply.
Yes you are right I want the top 50 terms from the matching documents?
I added this to the query I was running:
"aggregations": {
"importantTerms": {
"terms": {
"size": 25,
"field" : "title"
}
}
}
I got the terms aggregated and sorted by doc_count
.
I want the sorting to be done by the tf * idf
value instead. Is it even possible to get tf
and idf
of a particular term this way?
I also tried significant_terms
but it is just too slow.
Just to add to this, the terms that I get are unigrams
, is there a way to get bigrams
?
Try wrapping it in the sampler aggregation to focus the inspection on only the top N docs rather than all.
@Mark_Harwood
I am not sure if this aggregation is correct:
"aggs": {
"sample": {
"sampler": {
"shard_size": 10,
"field" : "title"
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "title"
}
}
}
}
}
When I run this query I get 0 keywords and a CircuitBreakingException
.
Is there something I am not doing correctly?
Try this:
"aggs": {
"sample": {
"sampler": {
"shard_size": 1000
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "title"
}
}
}
}
}
You want a reasonable sample size to get any sensible stats (like a survey - you wouldn't sample just 10 people). You don't need the "field" property - that only gets used to control diversity in the sample.
Thanks for the suggestion.
I got a result like:
{
"took": 927,
"timed_out": false,
"_shards": {
"total": 24,
"successful": 24,
"failed": 0
},
"hits": {
"total": 79,
"max_score": 0,
"hits": []
},
"aggregations": {
"sample": {
"doc_count": 79,
"keywords": {
"doc_count": 79,
"buckets": [
{
"key": "data",
"doc_count": 3,
"score": 234.19592738702784,
"bg_count": 6916
}
]
}
}
}
}
I only get one term. I tried increasing the shard_size
from 1000
to 10,000
and 20,000
but I got the same result. I also tried adding "size": 10,
to the aggregations but it is not helping.
Is it right to be getting only one significant_term
?
I don't know what your search is but it only matches 79 documents in total across 24 shards.
That could mean each shard is looking for statistically significant changes in a sample that may be as small as only 2 or 3 docs. That doesn't provide enough of a signal. You could lower the min_doc_count and shard_min_doc_count from the default settings but that likely won't help - we need a reasonable number of docs before the recommendations will be any good.
© 2020. All Rights Reserved - Elasticsearch
Apache, Apache Lucene, Apache Hadoop, Hadoop, HDFS and the yellow elephant logo are trademarks of the Apache Software Foundation in the United States and/or other countries.