Equivalent of 'Group By' along with significant_text aggregation

firasd · September 11, 2021, 2:47pm

Hi folks.. I'm looking for some equivalent of SQL's 'group by' or 'distinct' ... basically I have this query

{
	"query": {
		"range": {
			"post_date": {
				"from": "2016-04-14 00:00:00",
				"to": "2016-04-15 00:00:00"
			}
		}
	},
	
	"aggregations": {
		"keywords": {
			"significant_text": {
				"field": "post_content",
				"size": 50,
				"background_filter": {
					"range": {
						"post_date": {
							"from": "2016-01-01 00:00:00",
							"to": "2016-04-13 00:00:00"
						}
					}
				}
			}
		}
	}
}

The problem is that if I have a lot of posts between April 14-April 15, 2016 in the same category it skews the results I'm looking for. For example if I entered 15 posts about "Jack" I don't actually don't want Jack to show up as "doc_count": 15 if all entries are in the same category.

The category is available as term_id in the documents like this:

{
	"_index": "indexname",
	"_type": "_doc",
	"_id": "6149376",
	"_score": 1,
	"_source": {
		"post_id": 6149376,
		"ID": 6149376,
		"post_author": {
			"raw": "admin",
			"login": "admin",
			"display_name": "admin",
			"id": 1
		},
		"post_date": "2016-04-14 01:34:17",
		"post_date_gmt": "2016-04-14 01:34:17",
		"post_title": "title",
		"post_excerpt": "",
		"post_content_filtered": "content",
		"post_status": "publish",
		"post_name": "title",
		"post_modified": "2016-04-14 01:34:17",
		"post_modified_gmt": "2016-04-14 01:34:17",
		"post_parent": 0,
		"post_type": "post",
		"post_mime_type": "",
		"permalink": "https://example.com/title",
		"terms": {
			"category": [
				{
					"term_id": 1,
					"slug": "uncategorized",
					"name": "Uncategorized",
					"parent": 0,
					"term_taxonomy_id": 1,
					"term_order": 0,
					"facet": "{\"term_id\":1,\"slug\":\"uncategorized\",\"name\":\"Uncategorized\",\"parent\":0,\"term_taxonomy_id\":1,\"term_order\":0}"
				}
			],
		}
	}
}

Would appreciate pointers on how to modify the query to aggregate by the term_id. If this can be done for the background_filter as well, it may also be helpful.

firasd · September 11, 2021, 5:26pm

Update, I found that the 'diversified_sampler' can help, eg:

{
    "size": 50,
	"query": {
		"range": {
			"post_date": {
				"from": "2016-05-01 00:00:00",
				"to": "2016-05-04 00:00:00"
			}
		}
	},
	
	"aggregations": {
	        "my_unbiased_sample": {
      "diversified_sampler": {
        "shard_size": 1000,
        "field": "terms.category.term_id"
      },
      
      
      "aggregations": {
		"keywords": {
			"significant_text": {
				"field": "post_title",
				"size": 50,
				"background_filter": {
					"range": {
						"post_date": {
							"from": "2016-01-01 00:00:00",
							"to": "2016-05-30 00:00:00"
						}
					}
				}
			}
		}
	}
}
}
}

But I realized something, this is picking just one doc per category (which is better than picking every doc in every category) but what I really want is to let each category only contribute one doc to a bucket for a particular keyword

So rather than the equivalent of SQL's 'GROUP BY(category)' what I'm really after is something like 'GROUP BY(category, keyword)' if that's possible

Mark_Harwood · September 12, 2021, 5:56pm

You can pick how many max docs per field value.

Regardless, you may want to look at significant text’s filter_duplicate_text setting which can avoid skew in copy-pasted sections of text found in multiple docs.

firasd · September 13, 2021, 6:59am

Thanks, both those pointers (max docs per value and filter_duplicate_text) are helpful so will use them together

I also found the "exclude" setting, so now I'm using it to exclude eg "18 06 2016" or "18 june" using regex like:

			"significant_text": {
				"field": "post_content",
				"exclude": "(.*)?2016(.*)?|(.*)?16(.*)?|(.*)?06(.*)?|(.*)?18(.*)?",
				"filter_duplicate_text": true,

Mark_Harwood · September 13, 2021, 8:25am

Good to know.
Hopefully you have all your content in one index/shard rather than many if possible.
Most users have time based indices eg one per day and that is no use when trying to find stuff like trending topics on a day. The shard with today’s content has no history to compare it against and all the other shards don’t know what happened today so none of them can produce results.

system · October 11, 2021, 8:25am

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Grouping by similarity Elasticsearch	6	1942	May 20, 2019
ElasticSearch equivalent query for SQL group by multiple columns Elasticsearch	5	4888	June 16, 2017
How we implemented Group By in ES Elasticsearch	2	367	July 6, 2017
Categorizing similar values Elasticsearch	2	266	December 28, 2020
Aggregation on last element Elasticsearch	4	2796	July 6, 2017

Equivalent of 'Group By' along with significant_text aggregation

Related topics