Article for agregation


(Amar Srivastava) #1

I am just getting started with Elasticsearch 5 and have a question about structuring data and writing a query.

Let’s say you have a movie subscription service with normal and premium memberships.

Here is a sample of data generated by user activity:

  [
      {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 1
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "starwars",
        "sessionGroupID": 1,
        "elapsed": 200
    },
    {
        "eventType": "sessionInfo",
        "userType": "premium",
        "sessionGroupID": 2
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 2,
        "elapsed": 500
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 3
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 3,
        "elapsed": 10
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 4
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 4,
        "elapsed": 100
    },
    {
        "eventType": "sessionInfo",
        "userType": "normal",
        "sessionGroupID": 5
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "xmen",
        "sessionGroupID": 5,
        "elapsed": 5
    },
    {
        "eventType": "mediaPlay",
        "productSKU": "starwars",
        "sessionGroupID": 5,
        "elapsed": 25
     }
 ]

Question #1:

Given tens of millions of documents total, how would you write a query that totaled the elapsed viewing time of each movie, grouped by userType?

Desired query results:

premium users - total of "elapsed":
    xmen: 500
    starwars: 200

normal users - total of "elapsed":
    xmen: 115
    starwars: 25

Question #2:

If the data is not structured optimally for such a query, what would be the ideal structure?

  • For example, would it be better to put the “sessionInfo” documents in a separate Elasticsearch “index” or “type” than the user activity logs?

  • Would it be better to nest the “mediaPlay” events inside the sessionInfo documents?
    Thanks for any and all guidance and advice!


(Makoto Nozawa) #2

Hi,

I would add the userType field to the mediaPlay document (it means data denormalization).
Then, I query like this.

GET INDEX_NAME/_search
{
  "query": {
    "match_all": {}
  },
  "size": 0,
  "aggs": {
    "user_type_bucket": {
      "terms": {
        "field": "userType"
      },
      "aggs": {
        "product_bucket": {
          "terms": {
            "field": "productSKU"
          },
          "aggs": {
            "total_elapsed": {
              "sum": {
                "field": "elapsed"
              }
            }
          }
        }
      }
    }
  }
}

In addition, this page may helps you.
https://www.elastic.co/guide/en/elasticsearch/guide/current/relations.html


(system) #3

This topic was automatically closed 28 days after the last reply. New replies are no longer allowed.