Counting records missing from one month to another

Hello there!

I have a monthly interval in which I have multiple records with each 3 values:

  • date
  • email
  • amount

I'd like to be able to count all emails that simply disappear from a month to another (eg: customer leaving).

I'm already able to compute the delta each month via differences(unique_count(email)). However, in order to compute the churn rate, I need to count ONLY the missing emails from a month to another.

And despite spending some time to find a solution, I couldn't figure it out. Any pointer would be very welcome!

edit: if there's something like counting new records, I could use it to find the missing records.

Hi @olivierl Welcome to the community!

Yes figuring the exact emails that churn or don't show up is a bit harder... BUT there is a pretty cool way to do it and might provide even more insight.

I suggest you look at the "Latest Transform" and here is a blog on it... and the docs

In essence the Latest Transform will create an index that will track / the latest date that the email has been seen...

The resulting index will the the emails with the latest dates they were seen / used and old emails will "Age" then let the queries begin!

Then for instance if you want to see all the email that did not show up in the last 30 days you can just run that query... (all email with date <= now - 30days)

If you want to see what emails show up from now - 30 days to now - 60days (i.e. what emails from last month did not show up this month) you can run that.

The latest transform will run continuously keeping that index up to date...

Give it a try...

3 Likes

Hello @stephenb and thank you for your detailed answer :slight_smile:

However, it seems that Transform doesn't like a date on epoch_milli because as soon I use it as "Sort field", it says: "Invalid date" on the Transform Preview :thinking:

Also bonus question: how can I use this transform to track the churn rate in a historical fashion?

Eg churn rate of January, then February and so on? Because it will be "easy" to do it with a request for this month, but what about the evolution in the past?

Please provide samples of your data in json plus your mappings.... you did create a mapping right?

It looks like perhaps that you did not set a date type for your date data.

How many total records do you have?

Lets get the transform working then worry about the queries... it will depend on how much historical data you have the transform will go back to the beginning if you have the data.

I just did the transform on the sample data and it worked fine ... we need to look at your data and mappings.

So here is the mapping:

{
  "mappings": {
    "properties": {
      "amount EUR": {
        "type": "float"
      },
      "amount USD": {
        "type": "float"
      },
      "date": {
        "type": "date",
        "format": "epoch_millis"
      },
      "email": {
        "type": "keyword"
      }
    }
  }
}

In total, there's 38,000 records in the DB since 2015 (roughly).
Here is a JSON record:

{
  "_index": "rr",
  "_id": "xxxID",
  "_version": 3,
  "_score": 1,
  "_source": {
    "date": "1658721600000",
    "email": "john.doe@example.com",
    "amount USD": "200",
    "amount EUR": "170.1722"
  },
  "fields": {
    "date": [
      "2022-07-25T04:00:00.000Z"
    ],
    "amount EUR": [
      170.1722
    ],
    "amount USD": [
      200
    ],
    "email": [
      "john.doe@example.com"
    ]
  }
}

Hi @olivierl

That appears to be just a bug in the table preview... I proceeded with the transform and it worked fine. Make sure you set the same mapping for the transform results index.

Go ahead and run it and you should see good results in Discover...

What you will get is an index with the latest date each email showed up...

Then you can start to do the queries...

And if you think about it ... say the data looks like this... (of course this will be down to the date and time)

April : 24 emails
May : 70 emails
June : 50 emails
July : 1300 emails

That means

24 emails in April never showed up again... so they churned between April and May.
70 Emails Churned between May and June
50 Emails Churned between June and July
and there are 1300 Active in July...

I think that is what you are asking....

You should be able to build a lens table with this data...

That works indeed :slight_smile: Now I have the count of unique records last seen in a month over time.

Now, I need:

  1. to count the customer lost over the next period (the next month), because the churn is actually happening the next period.
  2. and then, divide that number by the number of unique emails on the original index (not the transformed index).

What's the best way to achieve this? For 1. should I transform the latest date with a +30 days?

For 2, I suppose I can call a function on another index, right?

Thanks again for your advices, it's really interesting, I didn't know transform/latest, it's very powerful :slight_smile:

edit: using "Time Shift" in Lens for the vertical axis seems to do the trick! Now to get the % amount if I can divide this count by total unique emails on the original index.

Formula

count(shift='1M')

Those should be the same number ... unique in original and unique in transformed are the same...

Although the trick is that you want to divide by the total emailsI think... need to think about that...

  1. Yeah thanks, I found the shift in the UI, very useful indeed :slight_smile: This point is solved!
  2. Hmm I'm not sure to understand. In this new index, I only have the latest "seen" email and not all the rest. Maybe I was confused and I should have use the transform on the very same index (the original?). Because on the new one, I have only the churned emails, not all the rest (since the churn rate is "lost customers/total customers" in the period).

Ohh I assumed you ran the transform on the Original Index... I thought you needed to figure out the churned email.. when last seen etc...etc..... that was my Whole Approach apologies ... sorry I completely miss-understood.

So now I am not really sure what data you have and don't have...

No you can not combined two different indices in a single calculation...

I'm fine to do that on the same index :slight_smile: I'm just a noob I was afraid to break my existing index on applying a transform on it :smiley:

So if I do that on the same index, should it create a new field like "last seen" with the date inside? So in that case, I can indeed compute it :slight_smile:

Hmm if I try to use the same index as destination for the transform, I got an error on creation:

{"id":"churn","error":{"name":"ResponseError","meta":{"body":{"error":{"root_cause":[{"type":"validation_exception","reason":"Validation Failed: 1: Destination index [rr] is included in source expression [rr*];"}],"type":"validation_exception","reason":"Validation Failed: 1: Destination index [rr] is included in source expression [rr*];"},"status":400},"statusCode":400,"headers":{"x-opaque-id":"2ca68399-7886-4ca4-a6bb-631880d31a66;kibana::management:","x-elastic-product":"Elasticsearch","content-type":"application/json;charset=utf-8","content-length":"293"},"meta":{"context":null,"request":{"params":{"method":"PUT","path":"/_transform/churn","body":"{\"source\":{\"index\":[\"rr*\"]},\"latest\":{\"unique_key\":[\"email\"],\"sort\":\"date\"},\"dest\":{\"index\":\"rr\"}}"

I'm not sure to understand what it means :thinking:

Apologies I am confused.

I do NOT recommends writing the transform back into the source index...

You will need to show me exact index names etc... I am lost... and need to get back to some other items.. if you follow the blog it send the transform data to a new index.

my-source-ecommerce-index
If your source have 100K records with 13K unique emails in it

and you run a latest transform into
latest-transform-my-ecommerce-index
This will have 13K rows each with 1 unique email, the last time it was seen...

Therefore your source and destination transform index have the same count of unique emails.. that was the only point I was making above...

So I have the rr index, with the mapping:

{
  "mappings": {
    "properties": {
      "amount EUR": {
        "type": "float"
      },
      "amount USD": {
        "type": "float"
      },
      "date": {
        "type": "date",
        "format": "epoch_millis"
      },
      "email": {
        "type": "keyword"
      }
    }
  }
}

You have 38k records with various dates, "grouped" in months. One record look like:

{
  "_index": "rr",
  "_id": "xxxID",
  "_version": 3,
  "_score": 1,
  "_source": {
    "date": "1658721600000",
    "email": "john.doe@example.com",
    "amount USD": "200",
    "amount EUR": "170.1722"
  },
  "fields": {
    "date": [
      "2022-07-25T04:00:00.000Z"
    ],
    "amount EUR": [
      170.1722
    ],
    "amount USD": [
      200
    ],
    "email": [
      "john.doe@example.com"
    ]
  }
}

To give you a simple example, you can have multiple records with the same email during the same month (multiple products) and the number of active customers is simple to compute (unique_count('email')). Monthly recurring revenue is also simple to compute (sum('amount USD')).

I can also compute how many new customers are added/removed each month with the help of differences.

But to get the churn, I needed to create the transform you helped me with: I created a new index with the same mapping than rr (called lastrr) containing ONLY last records of my user from the original index.

This works perfectly, and I can compute how many customers I'm losing every month, and even the amount.

The last part is computing the churn, which is "lost customers" (from lastrr index) divided by unique_count('email') from rr index. Since lastrr doesn't have all emails records from rr.

I hope it's more clear, sorry if I am not, the fact I'm not an ES expert might made me using bad vocabulary to describe what I'm trying to achieve.

Thank you very much for your assistance!

Ahhhh I think I see say your churned 77 emails in April of 11K total emails that were available in April you want to compute 77/11K as a percent...

I will need to think about that I do not have a simple answer for that..

but that is the overall sum.. and you would only want the emails that were available during that month.

count(shift='1M')/ overall_sum(count())

Elasticsearch / Kibana is not tableau yet when it comes to all the analysis ... :slight_smile:

That's correct: for example, I lost 5 customers in May (from lastrr index, last seen in April with the time shift it's then May the real month of churn). However, I can see I have 683 unique email in my original index (called rr) in this same month of May.

So my churn rate for May is 5/683 x 100 = 0.73%

However, I don't have the 683 unique count of emails in my lastrr index since it's only emails that churned in the end.

I don't know how to do that computation :thinking:

Let me give you a quick overview:

  • in my transformed/latest index (lastrr) I have between 5 and 20 records per month. It's indeed only the churned email/customers. So I have no total of the still existing customers in there!
  • in my original index (rr) I have all my records there, needed to compute the churn. But I don't have the "leavers" since they are transformed in the lastrr index.

Understood... now

I think you could probably run a Pivot Transform (for the totals) and Latest Transform (for the churn) and Write them into the same index and then you would have both data... but I would need to think about that.

@ghudgins Any Idea?

Yeah that was my intuition with my very little knowledge on pivot/latest transforms:

  1. create a mapping containing the date for the "latest" transform (so the same mapping than the original index)
  2. also add a field for the pivot, containing maybe the number of unique emails (but I don't know how to create that data, since the relevant number I need is per month). Pivot is new to me so I'm not sure to get it.

When I have those 2 things in my new index, I could then compute the churned emails and divide it by the right number computed in 2.

One think you WILL need to do is set the transform mapping results to have the correct types on the date... this your just gonna need to trust me on :slight_smile:

PUT discuss-latest-data-transform-result
{
  "mappings": {
    "properties": {
        ...
      "date": {
        "type": "date",
        "format":  "strict_date_optional_time||epoch_millis" <!--- THIS 
      }
      ...
  }
}