Graphing net churn using Redshift, DBT and Mode Analytics

  1. Charting upgrades, downgrades, new revenue and churn

    • What you’ll need to get started
  2. Synchronising your data to Redshift: the SQL
    • Producing a nice clean list of all invoices
    • Querying for new revenue per month
    • Querying for upgrades per month
  3. Bringing it all together
  4. Why use SQL at all?

Introduction

When it comes to basic data analytics, there are three key things that have changed over the last few years:

  1. It is easier than ever to set up a data warehouse with virtually no data limits, such as Amazon’s AWS Redshift or Google’s Big Query.
  2. It is easier than ever to synchronize data from SaaS products into such warehouses, with tools like StitchData and Segment.
  3. There are several easy-to-use, effective tools that make it trivial to query and chart the plethora of data in your warehouse in a sophisticated way, such as Mode Analytics and Looker.

At Vero, we’re big proponents of these advancements and how these tools work together. In most scenarios, the “data pipeline” outlined above is superior to tools like Mixpanel and Amplitude as it enables greater data accuracy and completeness.

In this guide, I’m going to introduce you to storing and analyzing data directly in Amazon’s Redshift using DBT and Mode Analytics. We’re going to chart some the basic SaaS metrics of MRR upgrades, downgrades, new and churned MRR to get started.

I’ve included the actual SQL we use at Vero, and I’ve broken down how it works so you can use it, get inspired by it or build upon it.

The full repository of SQL can be found here.

You can apply the same approach I use here to chart virtually anything with this setup, it’s very powerful.

Charting upgrades, downgrades, new revenue and churn

Here’s an example of the chart we’ll be building (note that “news” means new MRR in this chart):

What you’ll need to do this yourself

Here’s the setup you’ll need to get accurate and complete data and to chart it beautifully:

  • An AWS Redshift database to store all of your data.
  • StitchData to synchronise data from Stripe (or your invoicing platform) to Redshift.
  • DBT to turn the unorganised, raw data that is synchronised from Stripe into Redshift into clean database “views” you can query to create pretty charts.
  • Sinter to run DBT queries daily (or more often) so that your charts are up to date.
  • Mode Analytics to query the clean database views generated by DBT and Sinter to create beautiful charts.

All up, these components cost around $480 USD per month. They’re all scalable to virtually any limit, so you can rest assured they’ll scale with you as you grow.

Synchronising your data to Redshift
The SQL

To help you, we’ve pulled together a Github repository with all of the SQL needed to create these charts. As you become more familiar with DBT, you’ll observe that this repository is structured in line with DBT’s best practices and is a module that you should be able to import into another DBT project.

The SQL here ultimately outputs a table called revenue_movements. This table contains one row for each month and one column each of the metrics new_revenue, churned_revenue, upgrade_revenue and downgrade_revenue.

The resulting table can easily be queried using a tool like Mode.

As mentioned above, I recommend you download the full SQLand follow along in a text editor like Sublime.

Below I’ll step through each of the SQL queries used to build up the revenue_movements view.

Getting a nice clean list of all invoices

All of the SQL we’ll be writing here is formatted in line with the structure outlined by DBT. Every SQL query is a SELECT and results in a table or view. This table or view can then be used by other DBT queries. This “stacking” helps create clean and reusable SQL. You can read more about DBT’s viewpoint / manifest here.

Onto the SQL! The first step in the process of creating our chart is to generate a view called stripe_invoices that has a row for every single customer invoice.

When using DBT, I like to follow a standard structure for my SQL code inside the models folder:

  • I have a folder named after the source of the data, e.g. stripe.
  • Within that folder I always have a base folder. The SQL files inside the base folder are responsible for querying the underlying raw dataset and returning only the columns needed by our queries later on.
    • I will also do exclusions in these SQL files if I have any.
    • Files in this folder are always named using the format PROVIDER_MODEL.sql, for example stripe_invoices.sql.
  • I also have a transform folder. The SQL in this folder is responsible for taking the base output, e.g. stripe_invoices, and converting values into a more usable format, where useful.
    • These files are always named with the _xf suffix, as in stripe_invoices_xf.sql.
    • Critically, the base SQL should never be queried other than by it’s corresponding transform query. The point is that all queries from here on use the transformed models, and never directly call the base models. This gives everything a reliable structure with a lot of flexibility.

Here’s the SQL from the stripe_invoices base table:

select
  id                      as id,
  amount_due              as amount_due,
  customer                as customer,
  date                    as date,
  period_end              as period_end,
  forgiven                as forgiven,
  paid                    as paid,
  subscription            as subscription,
  total                   as total,
  starting_balance        as starting_balance
from
  {{ var('invoices_table') }}
where
  -- Remove customer IDs of users who will mess up MRR and other figures (e.g test accounts)
  customer not in ('cus_3Ma5IxIINayMgK')

As you can see, it’s pretty basic. The next step is to add in a few transformations.

In this particular example, I want to include customers’ email addresses, but this data is not available in the raw Stripe invoices table, so I need to join another table to get it. Similarly, I want to include an index of the invoice number for each customer (i.e. first invoice, last invoice, etc.) so I’ll add a query for this.

Here’s the transform SQL:

with 

stripe_invoices_with_line_items as (
  select 
    * 
  from 
    {{ ref('stripe_invoices_with_line_items_xf_aggregated') }}
), 

usage_based_recipients as (
  select
    *
  from
    {{ ref('usage_based_recipients') }}
)

select 
  stripe_invoices.id                              as id,
  stripe_invoices.customer                        as customer,
  stripe_customers_xf.email                       as customer_email,
  stripe_invoices.date                            as date,
  stripe_invoices.period_end                      as period_end,
  stripe_invoices.forgiven                        as forgiven,
  stripe_invoices.subscription                    as subscription,
  stripe_invoices.paid                            as paid,
  stripe_invoices_with_line_items.total           as total,
  row_number() over(
    partition by stripe_invoices.customer 
    order by stripe_invoices.date desc
  )                                               as last_payment,
  row_number() over(
    partition by stripe_invoices.customer 
    order by stripe_invoices.date asc
  )                                               as first_payment
from 
  {{ ref('stripe_invoices') }} stripe_invoices
left outer join
  stripe_invoices_with_line_items
on
  stripe_invoices.id = stripe_invoices_with_line_items.id
left outer join
  {{ ref('stripe_customers_xf') }} as stripe_customers_xf
on
  stripe_customers_xf.id = stripe_invoices.customer

As you an see, I’m leveraging another transformed dataset, the customers_xf table. The invoices_xf resulting from the above query has all of the data I might want later.

Querying for new revenue per month

Now that we’ve done to the hard work of creating a clean table of all of our customer invoices, it is relatively easy to query for revenue movements.

Let’s start with “new revenue per month”, as per the file mrr_new.sql:

select
  date_trunc('month', date) as month,
  sum(total)
from
  {{ref('invoices_by_customer')}}
where
  -- Find customers who had their first invoice this month
  asc_row = 1 
group by
  date_trunc('month', date)

This SQL finds all of the initial invoices for each customer and groups and sums them by month (e.g. January, February, etc.). You’ll note that the SQL leverages a table called invoices_by_customer.sql:

with 

invoices_by_customer as (
  select
    customer_id,
    date_trunc('month',date) as month,
    sum(total) as total
  from
    {{ ref('invoices') }}
  group by
    customer_id,
    date_trunc('month',date)
)

select 
  *,
  row_number() over (partition by customer_id order by month asc)   as asc_row 
  row_number() over (partition by customer_id order by month desc)  as desc_row 
from 
  invoices_by_customer

This SQL groups the invoices by customer and month. This interim step is key as it is possible in Stripe that customers have two invoices in a single month. This aggregation ensures that for each month there is just one total “invoiced” count per customer. The asc_row thus accurately represents an index for each customer’s invoices starting in the first month they paid and counting upwards.

Querying for upgrades per month

Querying for upgrades requires comparing two months and determining the difference between those two months. The SQL is located in mrr_upgrades.sql:

with

find_upgrades as (
  select 
    invoices.month                                                as month,
    invoices.asc_row                                              as asc_row,
    invoices.total                                                as current_month_total,
    coalesce(invoices_offset_one_month.total,0)                   as last_month_total,
    invoices.total - coalesce(invoices_offset_one_month.total,0)  as change
  from 
    {{ref('invoices_by_customer')}} invoices
  left outer join 
    {{ref('invoices_by_customer')}} invoices_offset_one_month
  on 
    add_months(invoices.month, -1) = invoices_offset_one_month.month
  and 
    invoices.customer_id = invoices_offset_one_month.customer_id
)

select
  month         as month,
  sum(change)   as revenue
from
  upgrades
where
  -- Only include customers whose current month total is more than last month (upgrades!)
  abs(upgrades.current_month_total) > 
  upgrades.last_month_total
and 
  -- Exclude customers who didn't move plans this month
  upgrades.change <> 0 and upgrades.change is not null
and
  -- Exclude customer's first invoices, as these are actually "new" customers, not upgrades
  upgrades.asc_row <> 1
group by
  upgrades.month

This query leverages the same invoices_by_customer table twice – once to query the total for “this month” and once to query the total for “last month”. It then displays the difference each of these totals.

We also need to ensure we exclude:

  • Customers that have their first invoice this month, as those are already counted in the earlier mrr_new query. They represent customers who are new and, as they didn’t have an invoice prior to their starting, there will always be a change recorded in their first month.
  • Customers who have invoices with a total of 0 this month, as these are customers who have churned – we’ll query them separately.
  • Customers who have a negative change, as these are customers who have downgraded since last month.

We’re left with a clean table containing the upgrades per month across our invoice history.

Bringing it all together

The files mrr_churns and mrr_downgrades provide the SQL to chart these numbers. They are similar to the queries above.

With the key metrics of new revenue, churned revenue, upgrades and downgrades accounted for, we can now bring these figures together. DBT separates models (like those above) and analysis SQL files into two. I always put the final SQL that I will copy and paste into Mode into the analysis folder. The SQL within that folder represents each SQL query used to build our dashboards.

The file revenue/mrr_movements.sql joins the mrr_new, mrr_upgrades, mrr_downgrades and mrr_churns tables discussed above:

select
  new_revenue.month,
  new_revenue.revenue         as new_revenue,
  churned_revenue.revenue     as churned_revenue,
  upgrade_revenue.revenue     as upgrade_revenue,
  downgrade_revenue.revenue   as downgrade_revenue
from
  {{ref('mrr_new')}} new_revenue
left outer join
  {{ref('mrr_churned')}} churned_revenue
on
  new_revenue.month = churned_revenue.month
left outer join
  {{ref('mrr_upgrades')}} upgrade_revenue
on
  new_revenue.month = upgrade_revenue.month
left outer join
  {{ref('mrr_downgrades')}} downgrade_revenue
on
  new_revenue.month = downgrade_revenue.month
where
  -- This works in Redshift, not in Postgres
  new_revenue.month < date_trunc('month',dateadd(months,-1,current_date))
order by
  new_revenue.month desc

There are two important things about this query:

  • Using left outer join ensures that if any of the joined tables are missing a value for a certain month, those months are included but with a null value.
  • We exclude the current month as always renders incomplete and shows generally alarming figures until the month ends.

Why use SQL at all?

The thing I find most powerful about charting in this way is that the result is idempotent. This is a fancy engineering word that means “you can run it hundreds of times and get the same output”.

Thanks to Stitch, Redshift and DBT you can reload the raw data and re-run the queries and get the exact output every time. One of the challenges I’ve always found with Amplitude, Mixpanel and others is that sending data via API makes it easy to load raw data twice or to have poor visibility on which data is present. I feel this approach solves that.

On top of that, the beauty of using SQL for this sort of charting is that you can:

  • Fully customise your charts.
  • Query across datasets and event databases, allowing you to do really sophisticated analyses.
  • Check your SQL code into GitHub or similar so you have a versioned history of every query and dashboard.

We’ve found running SQL queries allows us to produce consolidated reports which actually get looked at and actioned. It’s forced us to be selective about what we query and this may be one of the most valuable aspects of the whole thing!

Note: this script is slightly different from the one we use in production and has been pared down for this article, so if you see improvements or breaking code (though, it has been tested with our setup), please let me know and we’ll update it.

The post Graphing net churn using Redshift, DBT and Mode Analytics appeared first on Vero.

Source: FS – Email Marketing Blogs!
Graphing net churn using Redshift, DBT and Mode Analytics