Equivalent of 'Group By' along with significant_text aggregation

Hi folks.. I'm looking for some equivalent of SQL's 'group by' or 'distinct' ... basically I have this query

{
	"query": {
		"range": {
			"post_date": {
				"from": "2016-04-14 00:00:00",
				"to": "2016-04-15 00:00:00"
			}
		}
	},
	
	"aggregations": {
		"keywords": {
			"significant_text": {
				"field": "post_content",
				"size": 50,
				"background_filter": {
					"range": {
						"post_date": {
							"from": "2016-01-01 00:00:00",
							"to": "2016-04-13 00:00:00"
						}
					}
				}
			}
		}
	}
}

The problem is that if I have a lot of posts between April 14-April 15, 2016 in the same category it skews the results I'm looking for. For example if I entered 15 posts about "Jack" I don't actually don't want Jack to show up as "doc_count": 15 if all entries are in the same category.

The category is available as term_id in the documents like this:

{
	"_index": "indexname",
	"_type": "_doc",
	"_id": "6149376",
	"_score": 1,
	"_source": {
		"post_id": 6149376,
		"ID": 6149376,
		"post_author": {
			"raw": "admin",
			"login": "admin",
			"display_name": "admin",
			"id": 1
		},
		"post_date": "2016-04-14 01:34:17",
		"post_date_gmt": "2016-04-14 01:34:17",
		"post_title": "title",
		"post_excerpt": "",
		"post_content_filtered": "content",
		"post_status": "publish",
		"post_name": "title",
		"post_modified": "2016-04-14 01:34:17",
		"post_modified_gmt": "2016-04-14 01:34:17",
		"post_parent": 0,
		"post_type": "post",
		"post_mime_type": "",
		"permalink": "https://example.com/title",
		"terms": {
			"category": [
				{
					"term_id": 1,
					"slug": "uncategorized",
					"name": "Uncategorized",
					"parent": 0,
					"term_taxonomy_id": 1,
					"term_order": 0,
					"facet": "{\"term_id\":1,\"slug\":\"uncategorized\",\"name\":\"Uncategorized\",\"parent\":0,\"term_taxonomy_id\":1,\"term_order\":0}"
				}
			],
		}
	}
}

Would appreciate pointers on how to modify the query to aggregate by the term_id. If this can be done for the background_filter as well, it may also be helpful.

Update, I found that the 'diversified_sampler' can help, eg:

{
    "size": 50,
	"query": {
		"range": {
			"post_date": {
				"from": "2016-05-01 00:00:00",
				"to": "2016-05-04 00:00:00"
			}
		}
	},
	
	"aggregations": {
	        "my_unbiased_sample": {
      "diversified_sampler": {
        "shard_size": 1000,
        "field": "terms.category.term_id"
      },
      
      
      "aggregations": {
		"keywords": {
			"significant_text": {
				"field": "post_title",
				"size": 50,
				"background_filter": {
					"range": {
						"post_date": {
							"from": "2016-01-01 00:00:00",
							"to": "2016-05-30 00:00:00"
						}
					}
				}
			}
		}
	}
}
}
}

But I realized something, this is picking just one doc per category (which is better than picking every doc in every category) but what I really want is to let each category only contribute one doc to a bucket for a particular keyword

So rather than the equivalent of SQL's 'GROUP BY(category)' what I'm really after is something like 'GROUP BY(category, keyword)' if that's possible

You can pick how many max docs per field value.

Regardless, you may want to look at significant text’s filter_duplicate_text setting which can avoid skew in copy-pasted sections of text found in multiple docs.

Thanks, both those pointers (max docs per value and filter_duplicate_text) are helpful so will use them together

I also found the "exclude" setting, so now I'm using it to exclude eg "18 06 2016" or "18 june" using regex like:

			"significant_text": {
				"field": "post_content",
				"exclude": "(.*)?2016(.*)?|(.*)?16(.*)?|(.*)?06(.*)?|(.*)?18(.*)?",
				"filter_duplicate_text": true,

Good to know.
Hopefully you have all your content in one index/shard rather than many if possible.
Most users have time based indices eg one per day and that is no use when trying to find stuff like trending topics on a day. The shard with today’s content has no history to compare it against and all the other shards don’t know what happened today so none of them can produce results.

1 Like