RFM ANALYSIS
MySQL, Python, Jupyter

RFMMySQL-Web

RFM ANALYSIS - MySQL & PYTHON¶

The objective of this analysis is to segment online customers using three metrics. These are Recency, Frequency, and Monetary value.

Import libraries and connection information

In [194]:

# Loading libraries
import pandas as pd
import time
import sqlalchemy
from sqlalchemy import create_engine
import pymysql

# Loading the connection information
%load_ext sql
%sql mysql+pymysql://***:***@localhost:3306/brazilecom          

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Out[194]:

'Connected: root@brazilecom'

STEP 1¶

This first table will collect and join the necessary data. For this, we need to access 3 tables: Orders, Customers, Payments.

The critical fields that we need are Payment Amount, Customer Unique ID, and the Order Date.

In [5]:

%%sql
SELECT
    CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
    opay.payment_value AS amount,
    ord.order_id AS oid,
    cur.customer_unique_id AS cid,
    (SELECT 
            CAST(MAX(order_purchase_timestamp) AS DATE)
        FROM
            orders) AS maxDate
FROM
    orders AS ord
        JOIN
    customers AS cur ON ord.customer_id = cur.customer_id
        JOIN
    order_payments AS opay ON ord.order_id = opay.order_id
GROUP BY oid
LIMIT 5

 * mysql+pymysql://root:***@localhost:3306/brazilecom
5 rows affected.

Out[5]:

orderDate	amount	oid	cid	maxDate
2018-01-14	48.91	dc400373e624e9b0435847a79813c658	d02cbb85434c84acabb8e81bbf064d3e	2018-10-17
2018-05-23	20.75	b616f638cd86d92c7cbd06cf7484ee34	23daf66dd86071f57d332fa5562afbe9	2018-10-17
2017-10-04	76.17	407408433ee2a9ed77984aa5a06e2b2e	350e8172d9a1ed3cea31f50e09de726f	2018-10-17
2018-07-03	159.07	aa466a92af916f9f68e56fce7d3a21b1	a4031c5f35399c228a1cd7ed7431fd02	2018-10-17
2017-05-12	45.69	a41753c6a1d8accb89732e36243432d7	dc83fe8f72fb4388f079a2f9b586240c	2018-10-17

STEP 2¶

The next step is to create a calculated table that give us a

In [9]:

%%sql
-- STEP 1
with temp as (
        SELECT 
        CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
        opay.payment_value AS amount,
        ord.order_id AS oid,
        cur.customer_unique_id AS cid,
        (SELECT 
                CAST(MAX(order_purchase_timestamp) AS DATE)
            FROM
                orders) AS maxDate
    FROM
        orders AS ord
            JOIN
        customers AS cur ON ord.customer_id = cur.customer_id
            JOIN
        order_payments AS opay ON ord.order_id = opay.order_id
    GROUP BY oid
        )

-- STEP 2
-- This 'rfm' table will run baseline r-f-m calculations per customer. 
-- Note: we only include customers with a spend above $100.
    SELECT
        datediff(maxDate, max(orderDate)) as Recency,
        count(cid) as Frequency,
        sum(amount) as Monetary,
        cid
    FROM temp
    GROUP BY cid
    HAVING monetary > 100
    ORDER BY monetary desc
limit 5

 * mysql+pymysql://root:***@localhost:3306/brazilecom
5 rows affected.

Out[9]:

Recency	Frequency	Monetary	cid
383	1	13664.08	0a0a92112bd4c708ca5fde585afaa872
61	3	9553.02	46450c74a0d8c5ca9395da1daac6c120
564	2	7571.63	da122df9eeddfedc1dc1f5349a1a690c
94	1	7274.88	763c8b1c9c68a0229c42c9fc6f662b93
612	1	6929.31	dc4802a71eae9be1dd28f5d788ceb526

STEP 3¶

In the next step, we take the Recency, Frequency, and Monetary values, sort them in sequntial order, then create a rank from 1-100, where a higher number is better/more valueable. We then sum that rank into a single score, which we later use to segment the customers.

In [43]:

%%sql
-- STEP 1
with temp as (
        SELECT 
        CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
        opay.payment_value AS amount,
        ord.order_id AS oid,
        cur.customer_unique_id AS cid,
        (SELECT 
                CAST(MAX(order_purchase_timestamp) AS DATE)
            FROM
                orders) AS maxDate
    FROM
        orders AS ord
            JOIN
        customers AS cur ON ord.customer_id = cur.customer_id
            JOIN
        order_payments AS opay ON ord.order_id = opay.order_id
    GROUP BY oid
        ),

-- STEP 2
    -- This 'rfm' table will run baseline r-f-m calculations per customer. 
    -- Note: we only include customers with a spend above $100.
        rfm as (
        SELECT
            datediff(maxDate, max(orderDate)) as Recency,
            count(cid) as Frequency,
            round(sum(amount),2) as Monetary,
            cid
        FROM temp
        GROUP BY cid
        HAVING monetary > 100
        ORDER BY monetary desc
        )

-- STEP 3
-- This 'rfmRanking' table will give us an ordered ranking per each evaluation metric (r-f-m)
-- this is one area that can be tweaked for optimized groups. (10-decile, 20-twentile, etc)
SELECT *,
    (rRank + fRank + mRank) AS score
FROM (
    SELECT 
        Recency as R,
        Frequency as F,
        Monetary as M,
        NTILE(100) over(ORDER BY Recency DESC) as rRank,
        NTILE(100) over(ORDER BY Frequency) as fRank,
        NTILE(100) over(ORDER BY Monetary) as mRank,
        cid
    FROM rfm
    ) as sub1
ORDER BY score DESC
LIMIT 5

 * mysql+pymysql://root:***@localhost:3306/brazilecom
5 rows affected.

Out[43]:

R	F	M	rRank	fRank	mRank	cid	score
55	6	1094.63	100	100	99	dc813062e0fc23409cd255f7f53c7074	299
61	3	9553.02	99	100	100	46450c74a0d8c5ca9395da1daac6c120	299
65	4	1760.75	98	100	100	4facc2e6fbc2bffab2fea92d2b4aa7e4	298
58	17	927.63	99	100	98	8d50f5eadf50201ccdcedfb9e2ac8455	297
70	4	4655.91	97	100	100	c8460e4251689ba205045f3ea17884a1	297

STEP 4¶

In this step, we take the consolidated scores from our last table output and partition those scores into four parts. The top range of scores will distiguish our top customers, our VIPS, while the bottom part/range of scores will distinguish our lowest value customers.

In [45]:

%%sql
-- STEP 1
with temp as (
        SELECT 
        CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
        opay.payment_value AS amount,
        ord.order_id AS oid,
        cur.customer_unique_id AS cid,
        (SELECT 
                CAST(MAX(order_purchase_timestamp) AS DATE)
            FROM
                orders) AS maxDate
    FROM
        orders AS ord
            JOIN
        customers AS cur ON ord.customer_id = cur.customer_id
            JOIN
        order_payments AS opay ON ord.order_id = opay.order_id
    GROUP BY oid
        ),

    -- STEP 2
    -- This 'rfm' table will run baseline r-f-m calculations per customer. 
    -- Note: we only include customers with a spend above $100.
        rfm as (
        SELECT
            maxDate,
            max(orderDate) as mostRecent,
            datediff(maxDate, max(orderDate)) as Recency,
            count(cid) as Frequency,
            round(sum(amount),2) as Monetary,
            cid
        FROM temp
        GROUP BY cid
        HAVING monetary > 100
        ORDER BY monetary desc
        ),

    -- STEP 3    
    -- This 'rfmRanking' table will give us an ordered ranking per each evaluation metric (r-f-m)
    -- this is one area that can be tweaked for optimized groups. (10-decile, 20-twentile, etc)
    rfmRanking as (
    SELECT *,
        (rRank + fRank + mRank) AS score
    FROM (
        SELECT 
            Recency as R,
            Frequency as F,
            Monetary as M,
            NTILE(100) over(ORDER BY Recency DESC) as rRank,
            NTILE(100) over(ORDER BY Frequency) as fRank,
            NTILE(100) over(ORDER BY Monetary) as mRank,
            cid
        FROM rfm
        ) as sub1
    ORDER BY score DESC
        )
    
-- STEP 4
-- This final query gives us the customer ranking/grouping by decreasing value and importance. 
-- 'GroupRank' = 1 will be our most valuable customers, while 10 is the least valuable.
-- This query gives us the average per each evaluation metric, but also a list to send to marketing.
-- The NTILE function allows us to tweaked and optimized groups (decile, twentile, etc),
-- so this query can be continually reassessed if need. It's also possible to further segment each
-- 'GroupRank' if needed, so we can split the top group to find the 'absolute' best customers.

select *,
    CASE 
    WHEN GroupRank >= 90 THEN 'Vip'
    WHEN GroupRank >= 75 THEN 'Loyal'
    WHEN GroupRank >= 50 THEN 'Insider'
    ELSE 'Starter'
    END AS Segment
FROM (SELECT *,
    NTILE(100) OVER(ORDER by score) as GroupRank
From rfmRanking) AS sub1
ORDER BY score DESC
limit 5

 * mysql+pymysql://root:***@localhost:3306/brazilecom
5 rows affected.

Out[45]:

R	F	M	rRank	fRank	mRank	cid	score	GroupRank	Segment
55	6	1094.63	100	100	99	dc813062e0fc23409cd255f7f53c7074	299	100	Vip
61	3	9553.02	99	100	100	46450c74a0d8c5ca9395da1daac6c120	299	100	Vip
65	4	1760.75	98	100	100	4facc2e6fbc2bffab2fea92d2b4aa7e4	298	100	Vip
58	17	927.63	99	100	98	8d50f5eadf50201ccdcedfb9e2ac8455	297	100	Vip
50	2	895.36	100	98	98	4e1554d1a2dcb38dcf22770eb7390fe1	296	100	Vip

STEP 5 - Review¶

In this final table, we can aggregate the values above to review that our calculations are ordered correctly. We see that our VIP segment orders on average more recently, more frequently, and has a higher average Monetary spend than all other segments. These metrics are also all consistent or monotonic, which is exactly what we want.

In [54]:

%%sql
-- STEP 1
with temp as (
        SELECT 
        CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
        opay.payment_value AS amount,
        ord.order_id AS oid,
        cur.customer_unique_id AS cid,
        (SELECT 
                CAST(MAX(order_purchase_timestamp) AS DATE)
            FROM
                orders) AS maxDate
    FROM
        orders AS ord
            JOIN
        customers AS cur ON ord.customer_id = cur.customer_id
            JOIN
        order_payments AS opay ON ord.order_id = opay.order_id
    GROUP BY oid
        ),

    -- STEP 2
    -- This 'rfm' table will run baseline r-f-m calculations per customer. 
    -- Note: we only include customers with a spend above $100.
        rfm as (
        SELECT
            maxDate,
            max(orderDate) as mostRecent,
            datediff(maxDate, max(orderDate)) as Recency,
            count(cid) as Frequency,
            sum(amount) as Monetary,
            cid
        FROM temp
        GROUP BY cid
        HAVING monetary > 100
        ORDER BY monetary desc
        ),

    -- STEP 3    
    -- This 'rfmRanking' table will give us an ordered ranking per each evaluation metric (r-f-m)
    -- this is one area that can be tweaked for optimized groups. (10-decile, 20-twentile, etc)
    rfmRanking as (
    SELECT *,
        (rRank + fRank + mRank) AS score
    FROM (
        SELECT 
            Recency as R,
            Frequency as F,
            Monetary as M,
            NTILE(100) over(ORDER BY Recency DESC) as rRank,
            NTILE(100) over(ORDER BY Frequency) as fRank,
            NTILE(100) over(ORDER BY Monetary) as mRank,
            cid
        FROM rfm
        ) as sub1
    ORDER BY score DESC
        ),
    
-- STEP 4
-- This final query gives us the customer ranking/grouping by decreasing value and importance. 
-- 'GroupRank' = 1 will be our most valuable customers, while 10 is the least valuable.
-- This query gives us the average per each evaluation metric, but also a list to send to marketing.
-- The NTILE function allows us to tweaked and optimized groups (decile, twentile, etc),
-- so this query can be continually reassessed if need. It's also possible to further segment each
-- 'GroupRank' if needed, so we can split the top group to find the 'absolute' best customers.
    RFMFinalTable as (
    select *,
        CASE 
        WHEN GroupRank >= 90 THEN 'Vip'
        WHEN GroupRank >= 75 THEN 'Loyal'
        WHEN GroupRank >= 50 THEN 'Insider'
        ELSE 'Starter'
        END AS Segment
    FROM (SELECT *,
        NTILE(100) OVER(ORDER by score) as GroupRank
    From rfmRanking) AS sub1
    ORDER BY score DESC)

SELECT Segment,
        round(avg(r)) as RecAvg,
        round(avg(r) / sum(avg(r)) over()*100,1) as RecPerc,
        round(avg(f), 2) as FreqAvg,
        round(avg(f) / sum(avg(f)) over()*100,1) as FreqPerc,
        round(avg(m)) as MonAvg,
        round(avg(m) / sum(avg(m)) over()*100,1) as MonPerc,
        count(cid) as CustCount,
        round(count(cid) / sum(count(cid)) over()*100,1) as CustPerc
--        ,group_concat(cid SEPARATOR ", ") CustomerList
FROM RFMFinalTable
GROUP BY Segment
ORDER BY GroupRank DESC

 * mysql+pymysql://root:***@localhost:3306/brazilecom
4 rows affected.

Out[54]:

Segment	RecAvg	RecPerc	FreqAvg	FreqPerc	MonAvg	MonPerc	CustCount	CustPerc
Vip	231	21.4	1.39	31.2	591.0	42.4	5577	11.0
Loyal	258	23.8	1.05	23.4	414.0	29.7	7605	15.0
Insider	286	26.4	1.02	22.9	248.0	17.8	12683	25.0
Starter	307	28.4	1.01	22.5	139.0	10.0	24892	49.0

ALL SQL Code - Consolidated¶

In [ ]:

%%sql
(
-- STEP 1
with temp as (
        SELECT 
        CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
        opay.payment_value AS amount,
        ord.order_id AS oid,
        cur.customer_unique_id AS cid,
        (SELECT 
                CAST(MAX(order_purchase_timestamp) AS DATE)
            FROM
                orders) AS maxDate
    FROM
        orders AS ord
            JOIN
        customers AS cur ON ord.customer_id = cur.customer_id
            JOIN
        order_payments AS opay ON ord.order_id = opay.order_id
    GROUP BY oid
        ),

    -- STEP 2
    -- This 'rfm' table will run baseline r-f-m calculations per customer. 
    -- Note: we only include customers with a spend above $100.
        rfm as (
        SELECT
            maxDate,
            max(orderDate) as mostRecent,
            datediff(maxDate, max(orderDate)) as Recency,
            count(cid) as Frequency,
            sum(amount) as Monetary,
            cid
        FROM temp
        GROUP BY cid
        HAVING monetary > 100
        ORDER BY monetary desc
        ),

    -- STEP 3    
    -- This 'rfmRanking' table will give us an ordered ranking per each evaluation metric (r-f-m)
    -- this is one area that can be tweaked for optimized groups. (10-decile, 20-twentile, etc)
    rfmRanking as (
    SELECT *,
        (rRank + fRank + mRank) AS score
    FROM (
        SELECT 
            Recency as R,
            Frequency as F,
            Monetary as M,
            NTILE(100) over(ORDER BY Recency DESC) as rRank,
            NTILE(100) over(ORDER BY Frequency) as fRank,
            NTILE(100) over(ORDER BY Monetary) as mRank,
            cid
        FROM rfm
        ) as sub1
    ORDER BY score DESC
        ),
    
-- STEP 4
-- This final query gives us the customer ranking/grouping by decreasing value and importance. 
-- 'GroupRank' = 1 will be our most valuable customers, while 10 is the least valuable.
-- This query gives us the average per each evaluation metric, but also a list to send to marketing.
-- The NTILE function allows us to tweaked and optimized groups (decile, twentile, etc),
-- so this query can be continually reassessed if need. It's also possible to further segment each
-- 'GroupRank' if needed, so we can split the top group to find the 'absolute' best customers.
    RFMFinalTable as (
    select *,
        CASE 
        WHEN GroupRank >= 90 THEN 'Vip'
        WHEN GroupRank >= 75 THEN 'Loyal'
        WHEN GroupRank >= 50 THEN 'Insider'
        ELSE 'Starter'
        END AS Segment
    FROM (SELECT *,
        NTILE(100) OVER(ORDER by score) as GroupRank
    From rfmRanking) AS sub1
    ORDER BY score DESC)

SELECT Segment,
        round(avg(r)) as RecAvg,
        round(avg(r) / sum(avg(r)) over()*100,1) as RecPerc,
        round(avg(f), 2) as FreqAvg,
        round(avg(f) / sum(avg(f)) over()*100,1) as FreqPerc,
        round(avg(m)) as MonAvg,
        round(avg(m) / sum(avg(m)) over()*100,1) as MonPerc,
        count(cid) as CustCount,
        round(count(cid) / sum(count(cid)) over()*100,1) as CustPerc
--        ,group_concat(cid SEPARATOR ", ") CustomerList
FROM RFMFinalTable
GROUP BY Segment
ORDER BY GroupRank DESC
)

Visualize Output¶

Reset Dataframe¶

In [176]:

# Restructure dataframe for plotting
plotData = df.set_index('Segment').copy()
plotData = plotData[['RecPerc', 'FreqPerc', 'MonPerc']]
plotData.columns = ['Recency', 'Frequency', 'Monetary']
plotData

Out[176]:

	Recency	Frequency	Monetary
Segment
Vip	21.4	31.2	42.4
Loyal	23.8	23.4	29.7
Insider	26.4	22.9	17.8
Starter	28.4	22.5	10.0

Plot Data¶

Below we can see a visualization of each segment by our evaluation metrics. Although it is possible to further refine, we can be happy with the consistency as well as the monotonicity of the results. Each group is clearly "Better" than the next by each evaluation metric.

In [192]:

# Import libraries and plot
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
sns.set_palette("Blues_d")

# Plot Visuals
fig, axes = plt.subplots(1,3, sharey=True, figsize=(12,8))
# fig.suptitle('Plot')

sns.barplot(x=plotData.index, y=plotData['Recency'], ax=axes[0])
axes[0].set_title('Recency - Lower is better')

sns.barplot(x=plotData.index, y=plotData['Frequency'], ax=axes[1])
axes[1].set_title('Frequency - Higher is better')

sns.barplot(x=plotData.index, y=plotData['Monetary'], ax=axes[2])
axes[2].set_title('Monetary - Higher is better')

plt.show()