Dear all!
I am trying to perform a cohort analysis with Elasticsearch. For a quick
primer on what cohort analysis actually is, please take a look at this
wikipedia article (which IMO does not carry a lot of information, but it's
good enough to get an idea): http://en.wikipedia.org/wiki/Cohort_analysis
I have found a solution with which to tackle the problem, however, I am not
entirely happy with it. Let me describe my data and the DSL query that I've
come up with
Data:
We collect data on our website traffic, which results in about 50k to 100k
unique visits a day.
Cohort analysis:
Find the percentage of users within a 24-hour period which register at the
website and then actually go to our purchasing page (calculate the
percentages of how many users do this within the first, second, third etc.
hour after registration).
Two very abbreviated sample documents:
- sessionId: our unique identifier for performing counts
- url: the url for evaluating cohorts
- time: unix timestamp for event
{
"sessionId": "some-random-id",
"time": 1428238800000, (unix timestamp: Apr 5th, 3:00 pm)
"url": "/register"
}
{
"sessionId": "some-random-id",
"time": 1428241500000, (unix timestamp: Apr 5th, 3:45 pm)
"url": "/buy"
}
The query I've come up with does the following:
-
Basic query: restrict data set by time range & the target urls
-
Aggregations:
-
perform a terms agg on the unique identifier (sessionId)
For each resulting bucket, do:-
A date_histogram on the "time" field for the requested period (hour)
For each resulting bucket, do:- perform a filter aggregation for the "start state" (url = /register)
- perform a filter aggregation for the "target state" (url = /buy)
-
-
The result is then parsed, the users accumulated and the percentages
calculated programmatically.
Here's the query for the above pseudo code:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"gte": 1428192000000, (Apr 4th, 0:00 am)
"lt": 1428278400000, (Apr 5th, 0:00 am)
}
}
},
{
"terms": {
"url": [
"/register",
"/buy"
]
}
}
],
"must_not": [
{
"missing": {
"field": "sessionId"
}
}
]
}
}
}
},
"size": 0,
"aggs": {
"uniques": {
"terms": {
"field": "sessionId",
"size": 10000
},
"aggs": {
"bucket_hour_start_state": {
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs": {
"start_state": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "/register"
}
}
]
}
}
}
}
},
"bucket_hour_target_state": {
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs": {
"target_state": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "/buy"
}
}
]
}
}
}
}
}
}
}
}
}
The problem I have here is the terms query, which, in our scenario, will
never perform well given the amount of unique sessions we have to evaluate
(also: note the restriction to 10000 terms, when in reality the data set is
much larger!).
Do you have any suggestions on how to approach the problem domain
differently, i.e. without using "terms"?
For smaller periods of time I could still use terms (i.e. time frame is 10
minutes, bucketing (date_histogram) done for every minute).
What though, if I want to do the same aggregation for a period of, say, 6
months & would like to check perform cohorts for returning customers? The
data set would be too immense for "terms".
On a side note: I am also not interested in getting 100% accurate results,
an approximation would be sufficient for trend analysis.
I have the feeling that the problem domain should be split up into several
queries, the results need to be accumulated and the percentages counted
from there. However, I would, of course (and hence this post), prefer a
one-stop-one-query solution
Cheerio & thanks in advance for any insights.
--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1c050f24-f08f-4038-b10d-cdef7923ae22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.