I am just getting started with Elasticsearch 5 and have a question about structuring data and writing a query.
Let’s say you have a movie subscription service with normal and premium memberships.
Here is a sample of data generated by user activity:
[
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 1
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 1,
"elapsed": 200
},
{
"eventType": "sessionInfo",
"userType": "premium",
"sessionGroupID": 2
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 2,
"elapsed": 500
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 3
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 3,
"elapsed": 10
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 4
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 4,
"elapsed": 100
},
{
"eventType": "sessionInfo",
"userType": "normal",
"sessionGroupID": 5
},
{
"eventType": "mediaPlay",
"productSKU": "xmen",
"sessionGroupID": 5,
"elapsed": 5
},
{
"eventType": "mediaPlay",
"productSKU": "starwars",
"sessionGroupID": 5,
"elapsed": 25
}
]
Question #1:
Given tens of millions of documents total, how would you write a query that totaled the elapsed viewing time of each movie, grouped by userType?
Desired query results:
premium users - total of "elapsed":
xmen: 500
starwars: 200
normal users - total of "elapsed":
xmen: 115
starwars: 25
Question #2:
If the data is not structured optimally for such a query, what would be the ideal structure?
-
For example, would it be better to put the “sessionInfo” documents in a separate Elasticsearch “index” or “type” than the user activity logs?
-
Would it be better to nest the “mediaPlay” events inside the sessionInfo documents?
Thanks for any and all guidance and advice!