Date_histogram to compare query to global aggregate


(Jason Michael Baumgartner) #1

I have an Elasticsearch index that contains all public Reddit comments (~3.3 billion comments). What I would like to do is a date_histogram for a specific query but also get the total number of comments within each bucket so that I end up with a percentage of comments that contain a query term compared to all comments within that date range.

As an example, my query term is "Donald Trump" and I am using this to create a date_histogram by month:

q['aggs']['created_utc']['date_histogram']['field'] = "created_utc"
q['aggs']['created_utc']['date_histogram']['interval'] = "month"
q['aggs']['created_utc']['date_histogram']['order']['_key'] = "asc"

What I would like is something like a bg_count within each monthly bucket that has the total number of comments for that month bucket along with the count of comments matching "Donald Trump." This is an example of a bucket I get with the above parameters:

            {
                "doc_count": 62617,
                "key": 1388534400000,
                "key_as_string": "1388534400"
            },

I am looking for a way to also have a total_doc_count that shows the total number of comments within that bucket range along with the doc_count that represents the number of comments matching my query. The goal is to get a percentage of comments containing the query term compared to the global for each bucket returned.

I can use "significant_terms" with other aggregations but I'm not sure how to do it with date_histogram. I hope my question makes sense.

This dataset is public and updates in real-time if you are interested in running queries against it. For example, here is the URL to demonstrate the number of Reddit comments containing Trump over each month since Reddit's inception:

https://elastic.pushshift.io/rc/comments/_search?source={"aggs":{"created_utc":{"date_histogram":{"order":{"_key":"asc"},"field":"created_utc","interval":"year"}}},%22size%22:0,%22query%22:{%22bool%22:{%22must%22:[{%22simple_query_string%22:{%22fields%22:[%22body%22],%22default_operator%22:%22and%22,%22query%22:%22Trump%22}}]}},%22sort%22:{%22created_utc%22:%22desc%22}}

(You may have to copy and paste the entire thing as the highlighted portion is not the full URL)

Also, you can use this curl example to do the search from the command line:

curl -XGET 'https://elastic.pushshift.io/rc/comments/_search?pretty' -d '{"aggs":{"created_utc":{"date_histogram":{"order":{"_key":"asc"},"field":"created_utc","interval":"year"}}},"size":0,"query":{"bool":{"must":[{"simple_query_string":{"fields":["body"],"default_operator":"and","query":"Trump"}}]}},"sort":{"created_utc":"desc"}}'

Thank you!


(system) #2

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.