RFM ANALYSIS
MySQL, Python, Jupyter
RFM ANALYSIS - MySQL & PYTHON¶
The objective of this analysis is to segment online customers using three metrics. These are Recency, Frequency, and Monetary value.
Import libraries and connection information
# Loading libraries
import pandas as pd
import time
import sqlalchemy
from sqlalchemy import create_engine
import pymysql
# Loading the connection information
%load_ext sql
%sql mysql+pymysql://***:***@localhost:3306/brazilecom
The sql extension is already loaded. To reload it, use: %reload_ext sql
'Connected: root@brazilecom'
STEP 1¶
This first table will collect and join the necessary data. For this, we need to access 3 tables: Orders, Customers, Payments.
The critical fields that we need are Payment Amount, Customer Unique ID, and the Order Date.
%%sql
SELECT
CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
opay.payment_value AS amount,
ord.order_id AS oid,
cur.customer_unique_id AS cid,
(SELECT
CAST(MAX(order_purchase_timestamp) AS DATE)
FROM
orders) AS maxDate
FROM
orders AS ord
JOIN
customers AS cur ON ord.customer_id = cur.customer_id
JOIN
order_payments AS opay ON ord.order_id = opay.order_id
GROUP BY oid
LIMIT 5
* mysql+pymysql://root:***@localhost:3306/brazilecom 5 rows affected.
| orderDate | amount | oid | cid | maxDate |
|---|---|---|---|---|
| 2018-01-14 | 48.91 | dc400373e624e9b0435847a79813c658 | d02cbb85434c84acabb8e81bbf064d3e | 2018-10-17 |
| 2018-05-23 | 20.75 | b616f638cd86d92c7cbd06cf7484ee34 | 23daf66dd86071f57d332fa5562afbe9 | 2018-10-17 |
| 2017-10-04 | 76.17 | 407408433ee2a9ed77984aa5a06e2b2e | 350e8172d9a1ed3cea31f50e09de726f | 2018-10-17 |
| 2018-07-03 | 159.07 | aa466a92af916f9f68e56fce7d3a21b1 | a4031c5f35399c228a1cd7ed7431fd02 | 2018-10-17 |
| 2017-05-12 | 45.69 | a41753c6a1d8accb89732e36243432d7 | dc83fe8f72fb4388f079a2f9b586240c | 2018-10-17 |
STEP 2¶
The next step is to create a calculated table that give us a
%%sql
-- STEP 1
with temp as (
SELECT
CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
opay.payment_value AS amount,
ord.order_id AS oid,
cur.customer_unique_id AS cid,
(SELECT
CAST(MAX(order_purchase_timestamp) AS DATE)
FROM
orders) AS maxDate
FROM
orders AS ord
JOIN
customers AS cur ON ord.customer_id = cur.customer_id
JOIN
order_payments AS opay ON ord.order_id = opay.order_id
GROUP BY oid
)
-- STEP 2
-- This 'rfm' table will run baseline r-f-m calculations per customer.
-- Note: we only include customers with a spend above $100.
SELECT
datediff(maxDate, max(orderDate)) as Recency,
count(cid) as Frequency,
sum(amount) as Monetary,
cid
FROM temp
GROUP BY cid
HAVING monetary > 100
ORDER BY monetary desc
limit 5
* mysql+pymysql://root:***@localhost:3306/brazilecom 5 rows affected.
| Recency | Frequency | Monetary | cid |
|---|---|---|---|
| 383 | 1 | 13664.08 | 0a0a92112bd4c708ca5fde585afaa872 |
| 61 | 3 | 9553.02 | 46450c74a0d8c5ca9395da1daac6c120 |
| 564 | 2 | 7571.63 | da122df9eeddfedc1dc1f5349a1a690c |
| 94 | 1 | 7274.88 | 763c8b1c9c68a0229c42c9fc6f662b93 |
| 612 | 1 | 6929.31 | dc4802a71eae9be1dd28f5d788ceb526 |
STEP 3¶
In the next step, we take the Recency, Frequency, and Monetary values, sort them in sequntial order, then create a rank from 1-100, where a higher number is better/more valueable. We then sum that rank into a single score, which we later use to segment the customers.
%%sql
-- STEP 1
with temp as (
SELECT
CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
opay.payment_value AS amount,
ord.order_id AS oid,
cur.customer_unique_id AS cid,
(SELECT
CAST(MAX(order_purchase_timestamp) AS DATE)
FROM
orders) AS maxDate
FROM
orders AS ord
JOIN
customers AS cur ON ord.customer_id = cur.customer_id
JOIN
order_payments AS opay ON ord.order_id = opay.order_id
GROUP BY oid
),
-- STEP 2
-- This 'rfm' table will run baseline r-f-m calculations per customer.
-- Note: we only include customers with a spend above $100.
rfm as (
SELECT
datediff(maxDate, max(orderDate)) as Recency,
count(cid) as Frequency,
round(sum(amount),2) as Monetary,
cid
FROM temp
GROUP BY cid
HAVING monetary > 100
ORDER BY monetary desc
)
-- STEP 3
-- This 'rfmRanking' table will give us an ordered ranking per each evaluation metric (r-f-m)
-- this is one area that can be tweaked for optimized groups. (10-decile, 20-twentile, etc)
SELECT *,
(rRank + fRank + mRank) AS score
FROM (
SELECT
Recency as R,
Frequency as F,
Monetary as M,
NTILE(100) over(ORDER BY Recency DESC) as rRank,
NTILE(100) over(ORDER BY Frequency) as fRank,
NTILE(100) over(ORDER BY Monetary) as mRank,
cid
FROM rfm
) as sub1
ORDER BY score DESC
LIMIT 5
* mysql+pymysql://root:***@localhost:3306/brazilecom 5 rows affected.
| R | F | M | rRank | fRank | mRank | cid | score |
|---|---|---|---|---|---|---|---|
| 55 | 6 | 1094.63 | 100 | 100 | 99 | dc813062e0fc23409cd255f7f53c7074 | 299 |
| 61 | 3 | 9553.02 | 99 | 100 | 100 | 46450c74a0d8c5ca9395da1daac6c120 | 299 |
| 65 | 4 | 1760.75 | 98 | 100 | 100 | 4facc2e6fbc2bffab2fea92d2b4aa7e4 | 298 |
| 58 | 17 | 927.63 | 99 | 100 | 98 | 8d50f5eadf50201ccdcedfb9e2ac8455 | 297 |
| 70 | 4 | 4655.91 | 97 | 100 | 100 | c8460e4251689ba205045f3ea17884a1 | 297 |
STEP 4¶
In this step, we take the consolidated scores from our last table output and partition those scores into four parts. The top range of scores will distiguish our top customers, our VIPS, while the bottom part/range of scores will distinguish our lowest value customers.
%%sql
-- STEP 1
with temp as (
SELECT
CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
opay.payment_value AS amount,
ord.order_id AS oid,
cur.customer_unique_id AS cid,
(SELECT
CAST(MAX(order_purchase_timestamp) AS DATE)
FROM
orders) AS maxDate
FROM
orders AS ord
JOIN
customers AS cur ON ord.customer_id = cur.customer_id
JOIN
order_payments AS opay ON ord.order_id = opay.order_id
GROUP BY oid
),
-- STEP 2
-- This 'rfm' table will run baseline r-f-m calculations per customer.
-- Note: we only include customers with a spend above $100.
rfm as (
SELECT
maxDate,
max(orderDate) as mostRecent,
datediff(maxDate, max(orderDate)) as Recency,
count(cid) as Frequency,
round(sum(amount),2) as Monetary,
cid
FROM temp
GROUP BY cid
HAVING monetary > 100
ORDER BY monetary desc
),
-- STEP 3
-- This 'rfmRanking' table will give us an ordered ranking per each evaluation metric (r-f-m)
-- this is one area that can be tweaked for optimized groups. (10-decile, 20-twentile, etc)
rfmRanking as (
SELECT *,
(rRank + fRank + mRank) AS score
FROM (
SELECT
Recency as R,
Frequency as F,
Monetary as M,
NTILE(100) over(ORDER BY Recency DESC) as rRank,
NTILE(100) over(ORDER BY Frequency) as fRank,
NTILE(100) over(ORDER BY Monetary) as mRank,
cid
FROM rfm
) as sub1
ORDER BY score DESC
)
-- STEP 4
-- This final query gives us the customer ranking/grouping by decreasing value and importance.
-- 'GroupRank' = 1 will be our most valuable customers, while 10 is the least valuable.
-- This query gives us the average per each evaluation metric, but also a list to send to marketing.
-- The NTILE function allows us to tweaked and optimized groups (decile, twentile, etc),
-- so this query can be continually reassessed if need. It's also possible to further segment each
-- 'GroupRank' if needed, so we can split the top group to find the 'absolute' best customers.
select *,
CASE
WHEN GroupRank >= 90 THEN 'Vip'
WHEN GroupRank >= 75 THEN 'Loyal'
WHEN GroupRank >= 50 THEN 'Insider'
ELSE 'Starter'
END AS Segment
FROM (SELECT *,
NTILE(100) OVER(ORDER by score) as GroupRank
From rfmRanking) AS sub1
ORDER BY score DESC
limit 5
* mysql+pymysql://root:***@localhost:3306/brazilecom 5 rows affected.
| R | F | M | rRank | fRank | mRank | cid | score | GroupRank | Segment |
|---|---|---|---|---|---|---|---|---|---|
| 55 | 6 | 1094.63 | 100 | 100 | 99 | dc813062e0fc23409cd255f7f53c7074 | 299 | 100 | Vip |
| 61 | 3 | 9553.02 | 99 | 100 | 100 | 46450c74a0d8c5ca9395da1daac6c120 | 299 | 100 | Vip |
| 65 | 4 | 1760.75 | 98 | 100 | 100 | 4facc2e6fbc2bffab2fea92d2b4aa7e4 | 298 | 100 | Vip |
| 58 | 17 | 927.63 | 99 | 100 | 98 | 8d50f5eadf50201ccdcedfb9e2ac8455 | 297 | 100 | Vip |
| 50 | 2 | 895.36 | 100 | 98 | 98 | 4e1554d1a2dcb38dcf22770eb7390fe1 | 296 | 100 | Vip |
STEP 5 - Review¶
In this final table, we can aggregate the values above to review that our calculations are ordered correctly. We see that our VIP segment orders on average more recently, more frequently, and has a higher average Monetary spend than all other segments. These metrics are also all consistent or monotonic, which is exactly what we want.
%%sql
-- STEP 1
with temp as (
SELECT
CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
opay.payment_value AS amount,
ord.order_id AS oid,
cur.customer_unique_id AS cid,
(SELECT
CAST(MAX(order_purchase_timestamp) AS DATE)
FROM
orders) AS maxDate
FROM
orders AS ord
JOIN
customers AS cur ON ord.customer_id = cur.customer_id
JOIN
order_payments AS opay ON ord.order_id = opay.order_id
GROUP BY oid
),
-- STEP 2
-- This 'rfm' table will run baseline r-f-m calculations per customer.
-- Note: we only include customers with a spend above $100.
rfm as (
SELECT
maxDate,
max(orderDate) as mostRecent,
datediff(maxDate, max(orderDate)) as Recency,
count(cid) as Frequency,
sum(amount) as Monetary,
cid
FROM temp
GROUP BY cid
HAVING monetary > 100
ORDER BY monetary desc
),
-- STEP 3
-- This 'rfmRanking' table will give us an ordered ranking per each evaluation metric (r-f-m)
-- this is one area that can be tweaked for optimized groups. (10-decile, 20-twentile, etc)
rfmRanking as (
SELECT *,
(rRank + fRank + mRank) AS score
FROM (
SELECT
Recency as R,
Frequency as F,
Monetary as M,
NTILE(100) over(ORDER BY Recency DESC) as rRank,
NTILE(100) over(ORDER BY Frequency) as fRank,
NTILE(100) over(ORDER BY Monetary) as mRank,
cid
FROM rfm
) as sub1
ORDER BY score DESC
),
-- STEP 4
-- This final query gives us the customer ranking/grouping by decreasing value and importance.
-- 'GroupRank' = 1 will be our most valuable customers, while 10 is the least valuable.
-- This query gives us the average per each evaluation metric, but also a list to send to marketing.
-- The NTILE function allows us to tweaked and optimized groups (decile, twentile, etc),
-- so this query can be continually reassessed if need. It's also possible to further segment each
-- 'GroupRank' if needed, so we can split the top group to find the 'absolute' best customers.
RFMFinalTable as (
select *,
CASE
WHEN GroupRank >= 90 THEN 'Vip'
WHEN GroupRank >= 75 THEN 'Loyal'
WHEN GroupRank >= 50 THEN 'Insider'
ELSE 'Starter'
END AS Segment
FROM (SELECT *,
NTILE(100) OVER(ORDER by score) as GroupRank
From rfmRanking) AS sub1
ORDER BY score DESC)
SELECT Segment,
round(avg(r)) as RecAvg,
round(avg(r) / sum(avg(r)) over()*100,1) as RecPerc,
round(avg(f), 2) as FreqAvg,
round(avg(f) / sum(avg(f)) over()*100,1) as FreqPerc,
round(avg(m)) as MonAvg,
round(avg(m) / sum(avg(m)) over()*100,1) as MonPerc,
count(cid) as CustCount,
round(count(cid) / sum(count(cid)) over()*100,1) as CustPerc
-- ,group_concat(cid SEPARATOR ", ") CustomerList
FROM RFMFinalTable
GROUP BY Segment
ORDER BY GroupRank DESC
* mysql+pymysql://root:***@localhost:3306/brazilecom 4 rows affected.
| Segment | RecAvg | RecPerc | FreqAvg | FreqPerc | MonAvg | MonPerc | CustCount | CustPerc |
|---|---|---|---|---|---|---|---|---|
| Vip | 231 | 21.4 | 1.39 | 31.2 | 591.0 | 42.4 | 5577 | 11.0 |
| Loyal | 258 | 23.8 | 1.05 | 23.4 | 414.0 | 29.7 | 7605 | 15.0 |
| Insider | 286 | 26.4 | 1.02 | 22.9 | 248.0 | 17.8 | 12683 | 25.0 |
| Starter | 307 | 28.4 | 1.01 | 22.5 | 139.0 | 10.0 | 24892 | 49.0 |
ALL SQL Code - Consolidated¶
%%sql
(
-- STEP 1
with temp as (
SELECT
CAST(ord.order_purchase_timestamp AS DATE) AS orderDate,
opay.payment_value AS amount,
ord.order_id AS oid,
cur.customer_unique_id AS cid,
(SELECT
CAST(MAX(order_purchase_timestamp) AS DATE)
FROM
orders) AS maxDate
FROM
orders AS ord
JOIN
customers AS cur ON ord.customer_id = cur.customer_id
JOIN
order_payments AS opay ON ord.order_id = opay.order_id
GROUP BY oid
),
-- STEP 2
-- This 'rfm' table will run baseline r-f-m calculations per customer.
-- Note: we only include customers with a spend above $100.
rfm as (
SELECT
maxDate,
max(orderDate) as mostRecent,
datediff(maxDate, max(orderDate)) as Recency,
count(cid) as Frequency,
sum(amount) as Monetary,
cid
FROM temp
GROUP BY cid
HAVING monetary > 100
ORDER BY monetary desc
),
-- STEP 3
-- This 'rfmRanking' table will give us an ordered ranking per each evaluation metric (r-f-m)
-- this is one area that can be tweaked for optimized groups. (10-decile, 20-twentile, etc)
rfmRanking as (
SELECT *,
(rRank + fRank + mRank) AS score
FROM (
SELECT
Recency as R,
Frequency as F,
Monetary as M,
NTILE(100) over(ORDER BY Recency DESC) as rRank,
NTILE(100) over(ORDER BY Frequency) as fRank,
NTILE(100) over(ORDER BY Monetary) as mRank,
cid
FROM rfm
) as sub1
ORDER BY score DESC
),
-- STEP 4
-- This final query gives us the customer ranking/grouping by decreasing value and importance.
-- 'GroupRank' = 1 will be our most valuable customers, while 10 is the least valuable.
-- This query gives us the average per each evaluation metric, but also a list to send to marketing.
-- The NTILE function allows us to tweaked and optimized groups (decile, twentile, etc),
-- so this query can be continually reassessed if need. It's also possible to further segment each
-- 'GroupRank' if needed, so we can split the top group to find the 'absolute' best customers.
RFMFinalTable as (
select *,
CASE
WHEN GroupRank >= 90 THEN 'Vip'
WHEN GroupRank >= 75 THEN 'Loyal'
WHEN GroupRank >= 50 THEN 'Insider'
ELSE 'Starter'
END AS Segment
FROM (SELECT *,
NTILE(100) OVER(ORDER by score) as GroupRank
From rfmRanking) AS sub1
ORDER BY score DESC)
SELECT Segment,
round(avg(r)) as RecAvg,
round(avg(r) / sum(avg(r)) over()*100,1) as RecPerc,
round(avg(f), 2) as FreqAvg,
round(avg(f) / sum(avg(f)) over()*100,1) as FreqPerc,
round(avg(m)) as MonAvg,
round(avg(m) / sum(avg(m)) over()*100,1) as MonPerc,
count(cid) as CustCount,
round(count(cid) / sum(count(cid)) over()*100,1) as CustPerc
-- ,group_concat(cid SEPARATOR ", ") CustomerList
FROM RFMFinalTable
GROUP BY Segment
ORDER BY GroupRank DESC
)
Visualize Output¶
Reset Dataframe¶
# Restructure dataframe for plotting
plotData = df.set_index('Segment').copy()
plotData = plotData[['RecPerc', 'FreqPerc', 'MonPerc']]
plotData.columns = ['Recency', 'Frequency', 'Monetary']
plotData
| Recency | Frequency | Monetary | |
|---|---|---|---|
| Segment | |||
| Vip | 21.4 | 31.2 | 42.4 |
| Loyal | 23.8 | 23.4 | 29.7 |
| Insider | 26.4 | 22.9 | 17.8 |
| Starter | 28.4 | 22.5 | 10.0 |
Plot Data¶
Below we can see a visualization of each segment by our evaluation metrics. Although it is possible to further refine, we can be happy with the consistency as well as the monotonicity of the results. Each group is clearly "Better" than the next by each evaluation metric.
# Import libraries and plot
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')
sns.set_palette("Blues_d")
# Plot Visuals
fig, axes = plt.subplots(1,3, sharey=True, figsize=(12,8))
# fig.suptitle('Plot')
sns.barplot(x=plotData.index, y=plotData['Recency'], ax=axes[0])
axes[0].set_title('Recency - Lower is better')
sns.barplot(x=plotData.index, y=plotData['Frequency'], ax=axes[1])
axes[1].set_title('Frequency - Higher is better')
sns.barplot(x=plotData.index, y=plotData['Monetary'], ax=axes[2])
axes[2].set_title('Monetary - Higher is better')
plt.show()
