Cohort analysis using the query DSL

Dear all!

I am trying to perform a cohort analysis with Elasticsearch. For a quick
primer on what cohort analysis actually is, please take a look at this
wikipedia article (which IMO does not carry a lot of information, but it's
good enough to get an idea): http://en.wikipedia.org/wiki/Cohort_analysis

I have found a solution with which to tackle the problem, however, I am not
entirely happy with it. Let me describe my data and the DSL query that I've
come up with

Data:

We collect data on our website traffic, which results in about 50k to 100k
unique visits a day.

Cohort analysis:

Find the percentage of users within a 24-hour period which register at the
website and then actually go to our purchasing page (calculate the
percentages of how many users do this within the first, second, third etc.
hour after registration).

Two very abbreviated sample documents:

  • sessionId: our unique identifier for performing counts
  • url: the url for evaluating cohorts
  • time: unix timestamp for event

{
"sessionId": "some-random-id",
"time": 1428238800000, (unix timestamp: Apr 5th, 3:00 pm)
"url": "/register"
}

{
"sessionId": "some-random-id",
"time": 1428241500000, (unix timestamp: Apr 5th, 3:45 pm)
"url": "/buy"
}

The query I've come up with does the following:

  • Basic query: restrict data set by time range & the target urls

  • Aggregations:

    • perform a terms agg on the unique identifier (sessionId)
      For each resulting bucket, do:

      • A date_histogram on the "time" field for the requested period (hour)
        For each resulting bucket, do:

        • perform a filter aggregation for the "start state" (url = /register)
        • perform a filter aggregation for the "target state" (url = /buy)

The result is then parsed, the users accumulated and the percentages
calculated programmatically.

Here's the query for the above pseudo code:

{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"gte": 1428192000000, (Apr 4th, 0:00 am)
"lt": 1428278400000, (Apr 5th, 0:00 am)
}
}
},
{
"terms": {
"url": [
"/register",
"/buy"
]
}
}
],
"must_not": [
{
"missing": {
"field": "sessionId"
}
}
]
}
}
}
},
"size": 0,
"aggs": {
"uniques": {
"terms": {
"field": "sessionId",
"size": 10000
},
"aggs": {
"bucket_hour_start_state": {
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs": {
"start_state": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "/register"
}
}
]
}
}
}
}
},
"bucket_hour_target_state": {
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs": {
"target_state": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "/buy"
}
}
]
}
}
}
}
}
}
}
}
}

The problem I have here is the terms query, which, in our scenario, will
never perform well given the amount of unique sessions we have to evaluate
(also: note the restriction to 10000 terms, when in reality the data set is
much larger!).

Do you have any suggestions on how to approach the problem domain
differently, i.e. without using "terms"?

For smaller periods of time I could still use terms (i.e. time frame is 10
minutes, bucketing (date_histogram) done for every minute).
What though, if I want to do the same aggregation for a period of, say, 6
months & would like to check perform cohorts for returning customers? The
data set would be too immense for "terms".

On a side note: I am also not interested in getting 100% accurate results,
an approximation would be sufficient for trend analysis.

I have the feeling that the problem domain should be split up into several
queries, the results need to be accumulated and the percentages counted
from there. However, I would, of course (and hence this post), prefer a
one-stop-one-query solution :wink:

Cheerio & thanks in advance for any insights.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/1c050f24-f08f-4038-b10d-cdef7923ae22%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

I forgot to mention:

  • all document fields are "not_analyzed" (--> therefore the filtered query)

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/adf23968-02ca-4c71-9611-bee657afc1f4%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Check out the talk I gave at elasticon on "entity centric indexing"

https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes

The video is yet to be released but the slides are there.
Web session analysis is one example use case.

Cheers
Mark

On Friday, April 17, 2015 at 9:11:30 AM UTC+1, Christopher Blasnik wrote:

Dear all!

I am trying to perform a cohort analysis with Elasticsearch. For a quick
primer on what cohort analysis actually is, please take a look at this
wikipedia article (which IMO does not carry a lot of information, but it's
good enough to get an idea): http://en.wikipedia.org/wiki/Cohort_analysis

I have found a solution with which to tackle the problem, however, I am
not entirely happy with it. Let me describe my data and the DSL query that
I've come up with

Data:

We collect data on our website traffic, which results in about 50k to 100k
unique visits a day.

Cohort analysis:

Find the percentage of users within a 24-hour period which register at the
website and then actually go to our purchasing page (calculate the
percentages of how many users do this within the first, second, third etc.
hour after registration).

Two very abbreviated sample documents:

  • sessionId: our unique identifier for performing counts
  • url: the url for evaluating cohorts
  • time: unix timestamp for event

{
"sessionId": "some-random-id",
"time": 1428238800000, (unix timestamp: Apr 5th, 3:00 pm)
"url": "/register"
}

{
"sessionId": "some-random-id",
"time": 1428241500000, (unix timestamp: Apr 5th, 3:45 pm)
"url": "/buy"
}

The query I've come up with does the following:

  • Basic query: restrict data set by time range & the target urls

  • Aggregations:

    • perform a terms agg on the unique identifier (sessionId)
      For each resulting bucket, do:

      • A date_histogram on the "time" field for the requested period (hour)
        For each resulting bucket, do:

        • perform a filter aggregation for the "start state" (url =
          /register)
        • perform a filter aggregation for the "target state" (url = /buy)

The result is then parsed, the users accumulated and the percentages
calculated programmatically.

Here's the query for the above pseudo code:

{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"gte": 1428192000000, (Apr 4th, 0:00 am)
"lt": 1428278400000, (Apr 5th, 0:00 am)
}
}
},
{
"terms": {
"url": [
"/register",
"/buy"
]
}
}
],
"must_not": [
{
"missing": {
"field": "sessionId"
}
}
]
}
}
}
},
"size": 0,
"aggs": {
"uniques": {
"terms": {
"field": "sessionId",
"size": 10000
},
"aggs": {
"bucket_hour_start_state": {
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs": {
"start_state": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "/register"
}
}
]
}
}
}
}
},
"bucket_hour_target_state": {
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs": {
"target_state": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "/buy"
}
}
]
}
}
}
}
}
}
}
}
}

The problem I have here is the terms query, which, in our scenario, will
never perform well given the amount of unique sessions we have to evaluate
(also: note the restriction to 10000 terms, when in reality the data set is
much larger!).

Do you have any suggestions on how to approach the problem domain
differently, i.e. without using "terms"?

For smaller periods of time I could still use terms (i.e. time frame is 10
minutes, bucketing (date_histogram) done for every minute).
What though, if I want to do the same aggregation for a period of, say, 6
months & would like to check perform cohorts for returning customers? The
data set would be too immense for "terms".

On a side note: I am also not interested in getting 100% accurate results,
an approximation would be sufficient for trend analysis.

I have the feeling that the problem domain should be split up into several
queries, the results need to be accumulated and the percentages counted
from there. However, I would, of course (and hence this post), prefer a
one-stop-one-query solution :wink:

Cheerio & thanks in advance for any insights.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/c26163f3-772a-4562-99c3-664a8d8ae2ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

thanks for this Mark - I only got around to going through the slides today.

I will still have to find a solution which involves a more general approach
which can be applied to any of our indexed fields (logging data from
several systems, different document layout, and more systems joining in the
near future).
I'll formulate a new approach using several queries and then bunging the
data together programmatically. I'll fill you in on the solution when I get
around to doing this.

Cheers,

  • Chris

On Friday, 17 April 2015 11:50:49 UTC+2, ma...@elastic.co wrote:

Check out the talk I gave at elasticon on "entity centric indexing"

https://www.elastic.co/elasticon/2015/sf/building-entity-centric-indexes

The video is yet to be released but the slides are there.
Web session analysis is one example use case.

Cheers
Mark

On Friday, April 17, 2015 at 9:11:30 AM UTC+1, Christopher Blasnik wrote:

Dear all!

I am trying to perform a cohort analysis with Elasticsearch. For a quick
primer on what cohort analysis actually is, please take a look at this
wikipedia article (which IMO does not carry a lot of information, but it's
good enough to get an idea): http://en.wikipedia.org/wiki/Cohort_analysis

I have found a solution with which to tackle the problem, however, I am
not entirely happy with it. Let me describe my data and the DSL query that
I've come up with

Data:

We collect data on our website traffic, which results in about 50k to
100k unique visits a day.

Cohort analysis:

Find the percentage of users within a 24-hour period which register at
the website and then actually go to our purchasing page (calculate the
percentages of how many users do this within the first, second, third etc.
hour after registration).

Two very abbreviated sample documents:

  • sessionId: our unique identifier for performing counts
  • url: the url for evaluating cohorts
  • time: unix timestamp for event

{
"sessionId": "some-random-id",
"time": 1428238800000, (unix timestamp: Apr 5th, 3:00 pm)
"url": "/register"
}

{
"sessionId": "some-random-id",
"time": 1428241500000, (unix timestamp: Apr 5th, 3:45 pm)
"url": "/buy"
}

The query I've come up with does the following:

  • Basic query: restrict data set by time range & the target urls

  • Aggregations:

    • perform a terms agg on the unique identifier (sessionId)
      For each resulting bucket, do:

      • A date_histogram on the "time" field for the requested period (hour)
        For each resulting bucket, do:

        • perform a filter aggregation for the "start state" (url =
          /register)
        • perform a filter aggregation for the "target state" (url = /buy)

The result is then parsed, the users accumulated and the percentages
calculated programmatically.

Here's the query for the above pseudo code:

{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"range": {
"time": {
"gte": 1428192000000, (Apr 4th, 0:00 am)
"lt": 1428278400000, (Apr 5th, 0:00 am)
}
}
},
{
"terms": {
"url": [
"/register",
"/buy"
]
}
}
],
"must_not": [
{
"missing": {
"field": "sessionId"
}
}
]
}
}
}
},
"size": 0,
"aggs": {
"uniques": {
"terms": {
"field": "sessionId",
"size": 10000
},
"aggs": {
"bucket_hour_start_state": {
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs": {
"start_state": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "/register"
}
}
]
}
}
}
}
},
"bucket_hour_target_state": {
"date_histogram": {
"field": "time",
"interval": "hour"
},
"aggs": {
"target_state": {
"filter": {
"bool": {
"must": [
{
"term": {
"url": "/buy"
}
}
]
}
}
}
}
}
}
}
}
}

The problem I have here is the terms query, which, in our scenario, will
never perform well given the amount of unique sessions we have to evaluate
(also: note the restriction to 10000 terms, when in reality the data set is
much larger!).

Do you have any suggestions on how to approach the problem domain
differently, i.e. without using "terms"?

For smaller periods of time I could still use terms (i.e. time frame is
10 minutes, bucketing (date_histogram) done for every minute).
What though, if I want to do the same aggregation for a period of, say, 6
months & would like to check perform cohorts for returning customers? The
data set would be too immense for "terms".

On a side note: I am also not interested in getting 100% accurate
results, an approximation would be sufficient for trend analysis.

I have the feeling that the problem domain should be split up into
several queries, the results need to be accumulated and the percentages
counted from there. However, I would, of course (and hence this post),
prefer a one-stop-one-query solution :wink:

Cheerio & thanks in advance for any insights.

--
You received this message because you are subscribed to the Google Groups "elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/033a1f6a-9ee0-4833-adac-75855c4d9182%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.