Praxisbeispiel - Telko Kundenabgang-Vorhersage (Telco Customer Churn Prediction) bei Vertragskunden

Die Vorhersage der Kundenabwanderung (bei Vertragssituationen) sagt die Wahrscheinlichkeit voraus, dass Kunden die Produkte oder Dienstleistungen eines Unternehmens kündigen. In den meisten Fällen sind Unternehmen mit Stammkunden oder Kunden im Abonnement bestrebt, ihren Kundenstamm zu erhalten. Daher ist es wichtig, die Kunden zu verfolgen, die ihr Abonnement kündigen, und diejenigen, die den Dienst weiter nutzen. Dieser Ansatz setzt voraus, dass das Unternehmen das Verhalten seiner Kunden kennt und versteht und die Eigenschaften, die zu dem Risiko führen, dass der Kunde das Unternehmen verlässt.

Szenario

Annahme: Wir sind ein Telekommunikations-Unternehmen, das über historische Daten darüber verfügt, wie seine Kunden mit seinen Dienstleistungen interagiert haben. Das Unternehmen möchte wissen, wie hoch die Wahrscheinlichkeit ist, dass Kunden abwandern, damit es gezielte Marketingkampagnen starten kann.

Datensatz

Sage Kundenverhalten voraus, um die Kunden zu binden.

Ihr könnt alle relevanten Kundendaten analysieren und gezielte Kundenbindungsprogramme entwickeln.

https://www.kaggle.com/datasets/blastchar/telco-customer-churn

[1]:
import sys
# !{sys.executable} -m pip install kagglehub
# !{sys.executable} -m pip install @
# !{sys.executable} -m pip install missingno
[2]:
import kagglehub
[3]:
import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings

pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")
[4]:
from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# from xgboost import XGBClassifier
# from catboost import CatBoostClassifier
[5]:
# Download data from Kaggle
path = kagglehub.dataset_download("blastchar/telco-customer-churn")

print("Path to files:", path)
Path to files: /Users/veit/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1
[6]:
path_file = "~/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1/WA_Fn-UseC_-Telco-Customer-Churn.csv"
print(path_file)
~/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1/WA_Fn-UseC_-Telco-Customer-Churn.csv

Datensatz

[7]:
df = pd.read_csv(path_file)
df
[7]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No
7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No
7040 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes
7042 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No

7043 rows × 21 columns

Der Datensatz enthält Informationen über:

  • ob der Kunde innerhalb des letzten Monats gekündigt hat - Spalte „Churn“

  • Dienste, für die sich jeder Kunde angemeldet hat - „PhoneService“, „MultipleLines“, „InternetService“, „OnlineSecurity“, „OnlineBackup“, „DeviceProtection“, „TechSupport“ sowie „StreamingTV“ und „StreamingMovies“.

  • Informationen zum Kundenkonto - „tenure“ wie lange der Kunde bereits Kunde ist, „Contract“ (Vertrag), „PaperlessBilling“ (papierlose Abrechnung), „PaymentMethod“ (Zahlungsmethode), „MonthlyCharges“ (monatliche Gebühren) und „TotalCharges“ (Gesamtgebühren).

  • Demografische Informationen über Kunden - „gender“ (Geschlecht), „SeniorCitizen“ (Alter) und ob sie „Partner“ (Partner) und „Dependents“ (Familienangehörige) haben.

[8]:
print("shape: ", df.shape)
print()

print(df.info())
shape:  (7043, 21)

<class 'pandas.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   customerID        7043 non-null   str
 1   gender            7043 non-null   str
 2   SeniorCitizen     7043 non-null   int64
 3   Partner           7043 non-null   str
 4   Dependents        7043 non-null   str
 5   tenure            7043 non-null   int64
 6   PhoneService      7043 non-null   str
 7   MultipleLines     7043 non-null   str
 8   InternetService   7043 non-null   str
 9   OnlineSecurity    7043 non-null   str
 10  OnlineBackup      7043 non-null   str
 11  DeviceProtection  7043 non-null   str
 12  TechSupport       7043 non-null   str
 13  StreamingTV       7043 non-null   str
 14  StreamingMovies   7043 non-null   str
 15  Contract          7043 non-null   str
 16  PaperlessBilling  7043 non-null   str
 17  PaymentMethod     7043 non-null   str
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   str
 20  Churn             7043 non-null   str
dtypes: float64(1), int64(2), str(18)
memory usage: 1.9 MB
None

ACHTUNG!!

Wir sehen, dass „TotalCharges“ hier als object auftaucht, erwarten hier aber einen numerischen Typ. Das muss geändert werden - wir zeigen euch aber erst, was passiert, wenn wir es nicht ändern.

[9]:
df.columns.values
[9]:
<ArrowStringArray>
[      'customerID',           'gender',    'SeniorCitizen',
          'Partner',       'Dependents',           'tenure',
     'PhoneService',    'MultipleLines',  'InternetService',
   'OnlineSecurity',     'OnlineBackup', 'DeviceProtection',
      'TechSupport',      'StreamingTV',  'StreamingMovies',
         'Contract', 'PaperlessBilling',    'PaymentMethod',
   'MonthlyCharges',     'TotalCharges',            'Churn']
Length: 21, dtype: str
[10]:
df.dtypes
[10]:
customerID              str
gender                  str
SeniorCitizen         int64
Partner                 str
Dependents              str
tenure                int64
PhoneService            str
MultipleLines           str
InternetService         str
OnlineSecurity          str
OnlineBackup            str
DeviceProtection        str
TechSupport             str
StreamingTV             str
StreamingMovies         str
Contract                str
PaperlessBilling        str
PaymentMethod           str
MonthlyCharges      float64
TotalCharges            str
Churn                   str
dtype: object

Ermittle fehlende Werte und visualisiere sie

[11]:
## wir kannten schon die isnull() Funktion in pandas

for col in df.columns:
    print(col, df[col].isnull().sum())
customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0
[12]:
# Zudem könnte man sich die fehlende Werte auch in einer Matrix anzeigen lassen
# (nur haben wir hier keine fehlenden Werte)

msno.matrix(df);
../_images/1intro_customer-churn_18_0.png

Daten Manipulation / Bereinigung

[13]:
df["TotalCharges"] = pd.to_numeric(df.TotalCharges, errors="coerce")
# print(df.dtypes)
print(df.info())
<class 'pandas.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   customerID        7043 non-null   str
 1   gender            7043 non-null   str
 2   SeniorCitizen     7043 non-null   int64
 3   Partner           7043 non-null   str
 4   Dependents        7043 non-null   str
 5   tenure            7043 non-null   int64
 6   PhoneService      7043 non-null   str
 7   MultipleLines     7043 non-null   str
 8   InternetService   7043 non-null   str
 9   OnlineSecurity    7043 non-null   str
 10  OnlineBackup      7043 non-null   str
 11  DeviceProtection  7043 non-null   str
 12  TechSupport       7043 non-null   str
 13  StreamingTV       7043 non-null   str
 14  StreamingMovies   7043 non-null   str
 15  Contract          7043 non-null   str
 16  PaperlessBilling  7043 non-null   str
 17  PaymentMethod     7043 non-null   str
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7032 non-null   float64
 20  Churn             7043 non-null   str
dtypes: float64(2), int64(2), str(17)
memory usage: 1.8 MB
None
[14]:
print(df.isnull().sum())
customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64
[15]:
# Visualisierung

msno.matrix(df);
../_images/1intro_customer-churn_22_0.png
[16]:
df[np.isnan(df["TotalCharges"])]
[16]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
488 4472-LVYGI Female 0 Yes Yes 0 No No phone service DSL Yes No Yes Yes Yes No Two year Yes Bank transfer (automatic) 52.55 NaN No
753 3115-CZMZD Male 0 No Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.25 NaN No
936 5709-LVOEQ Female 0 Yes Yes 0 Yes No DSL Yes Yes Yes No Yes Yes Two year No Mailed check 80.85 NaN No
1082 4367-NUYAO Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.75 NaN No
1340 1371-DWPAZ Female 0 Yes Yes 0 No No phone service DSL Yes Yes Yes Yes Yes No Two year No Credit card (automatic) 56.05 NaN No
3331 7644-OMVMY Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 19.85 NaN No
3826 3213-VVOLG Male 0 Yes Yes 0 Yes Yes No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 25.35 NaN No
4380 2520-SGTTA Female 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service Two year No Mailed check 20.00 NaN No
5218 2923-ARZLG Male 0 Yes Yes 0 Yes No No No internet service No internet service No internet service No internet service No internet service No internet service One year Yes Mailed check 19.70 NaN No
6670 4075-WKNIU Female 0 Yes Yes 0 Yes Yes DSL No Yes Yes Yes Yes No Two year No Mailed check 73.35 NaN No
6754 2775-SEFEE Male 0 No Yes 0 Yes Yes DSL Yes Yes No Yes No No Two year Yes Bank transfer (automatic) 61.90 NaN No

Wir beobachten, dass die Spalte „tenure“ 0 ist bei diesen Einträgen, wo „TotalCharges“ NaN ist, obwohl die Spalte MonthlyCharges nicht leer ist.

Lass uns die Spalte „tenure“ näher anschauen und checken, ob es noch andere 0-Werte in diese Spalte gibt.

[17]:
df[df["tenure"] == 0].index
[17]:
Index([488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754], dtype='int64')

Feststellung und Entscheidung

Der Vergleich zeigt, dass es keine zusätzlichen fehlenden Werte in der „tenure“-Spalte gibt.

Wir löschen die Zeilen mit den fehlenden Werten in „tenure“ und „TotalCharges“, da es nur 11 von 7043 Zeilen sind und wir nicht von gravierenden Konsequenzen durch das Löschen dieser Zeilen ausgehen.

Wenn es nur die fehlenden Werte in „TotalCharges“ wäre, könnte man auch versuchen, die Werte darin durch Extrapolation aufzufüllen mit z.B. mean.

[18]:
# df.drop(df['tenure'] == 0, axis=0, inplace=False)

df_new = df[df["tenure"] != 0]

df_new[df_new['tenure'] == 0].index
[18]:
Index([], dtype='int64')
[19]:
df_new2 = df.drop(labels=df[df['tenure'] == 0].index, axis=0, inplace=False)
df_new2[df_new2['tenure'] == 0].index
[19]:
Index([], dtype='int64')
[20]:
df_new2
[20]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7040 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7042 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 21 columns

[21]:
df[df["tenure"]==0].index
[21]:
Index([488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754], dtype='int64')
[22]:
print(df.shape)
df.drop(labels=df[df["tenure"] == 0].index, axis=0, inplace=True)
df
(7043, 21)
[22]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7040 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7042 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 21 columns

[23]:
df.reset_index()
df
[23]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7040 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7042 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 21 columns

[24]:
df.reset_index(inplace=True)
df
[24]:
index customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7028 7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7029 7040 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7030 7041 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7031 7042 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 22 columns

[25]:
if "index" in df:
    df.drop("index", axis=1, inplace=True)
df
[25]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7028 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7029 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7030 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7031 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 21 columns

[26]:
df[df["TotalCharges"].isnull()]
[26]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
[27]:
df[~df["TotalCharges"].isnull()]
[27]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 No
7028 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 No
7029 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7030 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 Yes
7031 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 No

7032 rows × 21 columns

[28]:
df.isnull().sum()
[28]:
customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

EDA - Explorative Data Analysis (Daten Exploration)

[29]:
df["InternetService"].describe()
[29]:
count            7032
unique              3
top       Fiber optic
freq             3096
Name: InternetService, dtype: object
[30]:
df["InternetService"].unique()
[30]:
<ArrowStringArray>
['DSL', 'Fiber optic', 'No']
Length: 3, dtype: str
[31]:
df["InternetService"].value_counts()
[31]:
InternetService
Fiber optic    3096
DSL            2416
No             1520
Name: count, dtype: int64
[32]:
numerical_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
df[numerical_cols].describe()
[32]:
tenure MonthlyCharges TotalCharges
count 7032.000000 7032.000000 7032.000000
mean 32.421786 64.798208 2283.300441
std 24.545260 30.085974 2266.771362
min 1.000000 18.250000 18.800000
25% 9.000000 35.587500 401.450000
50% 29.000000 70.350000 1397.475000
75% 55.000000 89.862500 3794.737500
max 72.000000 118.750000 8684.800000
[33]:
df["gender"].value_counts()
[33]:
gender
Male      3549
Female    3483
Name: count, dtype: int64
[34]:
df["Churn"].value_counts()
[34]:
Churn
No     5163
Yes    1869
Name: count, dtype: int64
[35]:
# Anzahl der gechurnten Kunden pro Geschlecht ermitteln
print("Churned Customers")
print(df["Churn"][df["Churn"]=="Yes"].groupby(by=df["gender"]).count())
print()
print("Not-yet-churned Customers")
df["Churn"][df["Churn"]=="No"].groupby(by=df["gender"]).count()
Churned Customers
gender
Female    939
Male      930
Name: Churn, dtype: int64

Not-yet-churned Customers
[35]:
gender
Female    2544
Male      2619
Name: Churn, dtype: int64
[36]:
## Visualizierung der Churn Verteilung vs. Geschlecht

plt.figure(figsize=(6, 6))
labels =["Churn: Yes","Churn:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ["#ff6666", "#66b3ff"]
colors_gender = ["#c2c2f0", "#ffb3e6", "#c2c2f0", "#ffb3e6"]
explode = (0.3,0.3)
explode_gender = (0.1,0.1,0.1,0.1)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct="%1.1f%%",pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90,frame=True, explode=explode,radius=10, textprops =textprops, counterclock = True, )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),5,color="black", fc="white",linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title("Churn Distribution w.r.t Gender: Male(M), Female(F)", fontsize=15, y=1.1)

# show plot

plt.axis("equal")
plt.tight_layout()
plt.show()
../_images/1intro_customer-churn_46_0.png
[37]:
## Visualisierung Verteilung: Churn vs. Kundenvertrag

fig = px.histogram(df, x="Churn", color="Contract", barmode="group", title="<b>Verteilung: Churn vs. Kundenvertrag<b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
../_images/1intro_customer-churn_47_0.png
[38]:
# Verteilung: Churn vs. Kunden Zahlungsmethod distribution w.r.t. Churn

fig = px.histogram(df, x="Churn", color="PaymentMethod", title="<b>Verteilung: Churn vs. Kunden Zahlungsmethod distribution w.r.t. Churn</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
../_images/1intro_customer-churn_48_0.png

Wir beobachten folgendes:

  • Die meisten Kunden, die abgewandert sind, hatten einen elektronischen Scheck als Zahlungsmethode.

  • Kunden, die sich für die automatische Überweisung per Kreditkarte oder Banküberweisung und den Versand von Schecks als Zahlungsmethode entschieden, sind seltener abgewandert.

[39]:
df["InternetService"].unique()
[39]:
<ArrowStringArray>
['DSL', 'Fiber optic', 'No']
Length: 3, dtype: str
[40]:
df[df["gender"]=="Male"][["InternetService", "Churn"]].value_counts()
[40]:
InternetService  Churn
DSL              No       992
Fiber optic      No       910
No               No       717
Fiber optic      Yes      633
DSL              Yes      240
No               Yes       57
Name: count, dtype: int64
[41]:
df[df["gender"]=="Female"][["InternetService", "Churn"]].value_counts()
[41]:
InternetService  Churn
DSL              No       965
Fiber optic      No       889
No               No       690
Fiber optic      Yes      664
DSL              Yes      219
No               Yes       56
Name: count, dtype: int64
[42]:
fig = go.Figure()

fig.add_trace(go.Bar(
  x = [["Churn:No", "Churn:No", "Churn:Yes", "Churn:Yes"],
       ["Female", "Male", "Female", "Male"]],
  y = [965, 992, 219, 240],
  name = 'DSL',
))

fig.add_trace(go.Bar(
  x = [["Churn:No", "Churn:No", "Churn:Yes", "Churn:Yes"],
       ["Female", "Male", "Female", "Male"]],
  y = [889, 910, 664, 633],
  name = "Fiber optic",
))

fig.add_trace(go.Bar(
  x = [["Churn:No", "Churn:No", "Churn:Yes", "Churn:Yes"],
       ["Female", "Male", "Female", "Male"]],
  y = [690, 717, 56, 57],
  name = "No Internet",
))

fig.update_layout(title_text="<b>Verteilung: Churn vs. InternetService und Geschlecht</b>")

fig.show()
../_images/1intro_customer-churn_53_0.png

Viele Kunden hatten sich für den Glasfaserdienst (Fiber optic) entschieden. Diese scheinen eine hohe Abwanderungsrate zu haben, was auf eine Unzufriedenheit mit dieser Art von Internetdienst hindeuten könnte.

Kunden, die einen DSL-Dienst nutzen, sind in der Mehrzahl und haben eine geringere Abwanderungsrate als Glasfaserkunden.

[43]:
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="Churn", color="Dependents", barmode="group", title="<b>Verteilung: Churn vs. Familienangehörige</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
../_images/1intro_customer-churn_55_0.png

Es scheint, dass die Kunden ohne Familienangehörige („dependents“) wahrscheinlicher abwandern.

[44]:
color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df, x="Churn", color="Partner", barmode="group", title="<b>Verteilung: Churn vs. Partners</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
../_images/1intro_customer-churn_57_0.png

Kunden OHNE Partner scheinen eher dazu zu tendieren abzuwandern.

[45]:
color_map = {"Yes": "#00CC96", "No": "#B6E880"}
fig = px.histogram(df, x="Churn", color="SeniorCitizen", title="<b>Verteilung: Churn vs. Senior Citizen</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
../_images/1intro_customer-churn_59_0.png
  • Wir stellen fest, dass generell der Anteil an „seniorigen“ Kunden recht niedrig ist.

  • Dazu kann man sagen, dass ein die seniorigen Kunden eher dazu tendieren abzuwandern.

[46]:
color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="Churn", color="OnlineSecurity", barmode="group", title="<b>Verteilung: Churn vs. Online Security</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
../_images/1intro_customer-churn_61_0.png

Kunden OHNE Online Security tendieren stark dazu abzuwandern.

[47]:
color_map = {"Yes": "#FFA15A", "No": "#00CC96"}
fig = px.histogram(df, x="Churn", color="PaperlessBilling",  title="<b>Verteilung: Churn vs. Papierlose Rechnung</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
../_images/1intro_customer-churn_63_0.png

Kunden mit Papierlosen Zahlen scheinen dazu zu tendieren, abzuwandern.

[48]:
fig = px.histogram(df, x="Churn", color="TechSupport",barmode="group",  title="<b>Verteilung: Churn vs. TechSupport</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()
../_images/1intro_customer-churn_65_0.png

Kunden OHNE TechSupport scheinen stark dazu zu tendieren abzuwandern

[49]:
sns.set_context("paper",font_scale=1.1)
ax = sns.kdeplot(df.MonthlyCharges[(df["Churn"] == "No") ],
                color="Red", shade = True);
ax = sns.kdeplot(df.MonthlyCharges[(df["Churn"] == "Yes") ],
                ax =ax, color="Blue", shade= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel("Density");
ax.set_xlabel("Monthly Charges");
ax.set_title("Verteilung: Churn vs. Monthly Charges");
../_images/1intro_customer-churn_67_0.png

Kunden MIT höheren monatlichen Kosten tendieren eher dazu abzuwandern

[50]:
ax = sns.kdeplot(df.TotalCharges[(df["Churn"] == "No") ],
                color="Gold", shade = True);
ax = sns.kdeplot(df.TotalCharges[(df["Churn"] == "Yes") ],
                ax =ax, color="Green", shade= True);
ax.legend(["Not Chu0rn","Churn"],loc="upper right");
ax.set_ylabel("Density");
ax.set_xlabel("Total Charges");
ax.set_title("Verteilung: Churn vs. Total Charges");
../_images/1intro_customer-churn_69_0.png

Interessanterweise scheint der höhere Total Charge keine direkte Auswirkung auf erhöhte Abwanderungwahrscheinlichkeit zu haben.

[51]:
fig = px.box(df, x="Churn", y = "tenure")

# Update yaxis properties
fig.update_yaxes(title_text="Tenure (Months)", row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text="Churn", row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family="Courier"),
    title="<b>Tenure vs Churn</b>",
)

fig.show()
../_images/1intro_customer-churn_71_0.png

NEUE Kunden scheinen tendenziell eher abzuwandern.

[52]:
# Plotte den Correlation Heatmap (pandas nutzt Pearson Correlation by default)

plt.figure(figsize=(25, 10))

corr = df.apply(lambda x: pd.factorize(x)[0]).corr()

mask = np.triu(np.ones_like(corr, dtype=bool))

ax = sns.heatmap(corr, mask=mask, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, linewidths=.2, cmap="coolwarm", vmin=-1, vmax=1)
../_images/1intro_customer-churn_73_0.png
[53]:
df.OnlineSecurity.value_counts()
[53]:
OnlineSecurity
No                     3497
Yes                    2015
No internet service    1520
Name: count, dtype: int64

Data Preprocessing

Zielvariable CHURN

Da es sich um ein binäres Klassifikationsproblem hanaelt, sollte die Target-Variable eine binäre Variable sein (numerisch ausgedrückt). Da sie aktuell noch nur String ist, wird nun die Churn-Spalte als Erstes in binary umgewandelt.

[54]:
df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

df.head()
[54]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 0
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 0
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 1
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 1

Wir entfernen die customerID-Spalte.

[55]:
## Verwerfe ID-columns
if "customerID" in df:
    df.drop("customerID", axis=1, inplace=True)

df
[55]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 0
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 0
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 1
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 0
7028 Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 0
7029 Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 0
7030 Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 1
7031 Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 0

7032 rows × 20 columns

Nun haben wir uns schon sehr vieles angeschaut und viel getan …

Muessten wir noch etwas tun, bevor wir die Daten splitten? Können wir den Datensatz nun direkt dem Modell zuführen?

Kategorische Spalten umwandeln

Um die kategorischen Spalten richtig in numerische Spalten umzuwandeln, lass uns sie genauer anschauen und identifizieren, was sich hinter den Strings wirklich verbirgt.

[56]:
df
[56]:
gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 0
1 Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 0
2 Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 1
3 Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 0
4 Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.50 0
7028 Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.90 0
7029 Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 0
7030 Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.60 1
7031 Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.50 0

7032 rows × 20 columns

DatenAnalyse:

Ist die Spalte:

  • binary?

  • boolean?

  • categorical?

  • numeric?

Datentypen und ihre Spalten:

  • binary: gender;

  • boolean: SeniorCitizen(*), Partner, Dependents, PhoneService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, PaperlessBilling

  • categorical: MultipleLines, InternetService, Contract, PaymentMethod

  • numeric: tenure, MonthlyCharges, TotalCharges

(*)SeniorCitizen: In diesem Datensatz ist die Spalte bereits durch 0 und 1 ausgedrückt, d.h. rein rechnerisch bräuchte sie nicht nochmal extra als boolean aufgenommen werden. Allerdings sorgt es für mehr Robustheit und Stabilität in der Pipeline und somit Best Practice, doch alle Datentypen entsprechend zu identifizieren.

Recap:

  • OrdinalEncoder: gut für kategorische Werte, die eine innere Ordnung bzw Rang haben, z.B. „niedrig“, „medium“, „hoch“ in Umfrageantworte

  • OneHotEncoder: gut für kategorische Features ohne bestimmte innere Ordnung in einem „one-hot numerischen Array“.

    Achtung: OneHotEncoder kann die Dimension (Anzahl Spalten) der Feature-Matrix drastisch erhöhen und „spärliche Matrizen“ (scarce matrices) mit lauter „0“ in der Spalte kreieren, wenn die Originalspalte viele verschieden Werte hat (hohe Kardinalität), was wiederum schlechte Datenqualität bedeuten würde. Alternativen: Es gibt auch weitere Alternativen zum One-Hot-Encoding, z.B. kann man auf Frequency (Count) Encoding ausweichen:

    • hier werden Kategorien basierend auf der Frequency (Häufigkeit, in denen sie auftreten) im Datensatz. Das kann helfen, die Wichtigkeit dieser Kategorie im Datensatz direkt zu erfassen und mitzugeben. Hat aber auch Pitfalls, z.B. wenn das ein biased Datensatz ist.

Achtung!!

Unterteile die Columns in 3 Kategorien: Standardisierung, Ordinal-Encoding and One-Hot-Encoding

binary (direkte Umwandlung zu 0-1): gender

boolean (direkte Umwandlung zu 0-1): SeniorCitizen, Partner, Dependents, PhoneService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, PaperlessBilling

categorical (One-Hot oder Frequency Encoding etc): MultipleLines, InternetService, Contract, PaymentMethod

numeric: tenure, MonthlyCharges, TotalCharges

[57]:
cols_binary = ["gender"]
cols_boolean = ["SeniorCitizen", "Partner", "Dependents", "PhoneService", "OnlineSecurity", "OnlineBackup",
               "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "PaperlessBilling"]
cols_cat = ["MultipleLines", "InternetService", "Contract", "PaymentMethod"]
cols_numeric = ["tenure", "MonthlyCharges", "TotalCharges"]
[58]:
df[cols_boolean]
[58]:
SeniorCitizen Partner Dependents PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling
0 0 Yes No No No Yes No No No No Yes
1 0 No No Yes Yes No Yes No No No No
2 0 No No Yes Yes Yes No No No No Yes
3 0 No No No Yes No Yes Yes No No No
4 0 No No Yes No No No No No No Yes
... ... ... ... ... ... ... ... ... ... ... ...
7027 0 Yes Yes Yes Yes No Yes Yes Yes Yes Yes
7028 0 Yes Yes Yes No Yes Yes No Yes Yes Yes
7029 0 Yes Yes No Yes No No No No No Yes
7030 1 Yes No Yes No No No No No No Yes
7031 0 No No Yes Yes No Yes Yes Yes Yes Yes

7032 rows × 11 columns

[59]:
df[cols_cat]
[59]:
MultipleLines InternetService Contract PaymentMethod
0 No phone service DSL Month-to-month Electronic check
1 No DSL One year Mailed check
2 No DSL Month-to-month Mailed check
3 No phone service DSL One year Bank transfer (automatic)
4 No Fiber optic Month-to-month Electronic check
... ... ... ... ...
7027 Yes DSL One year Mailed check
7028 Yes Fiber optic One year Credit card (automatic)
7029 No phone service DSL Month-to-month Electronic check
7030 Yes Fiber optic Month-to-month Mailed check
7031 No Fiber optic Two year Bank transfer (automatic)

7032 rows × 4 columns

[60]:
# Mapping von "Female" zu 1 und "Male" zu 0

df["gender"] = df["gender"].map({"Female": 1, "Male": 0})
df["gender"]
[60]:
0       1
1       0
2       0
3       0
4       1
       ..
7027    0
7028    1
7029    1
7030    0
7031    0
Name: gender, Length: 7032, dtype: int64
[61]:
cols_boolean
[61]:
['SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'PaperlessBilling']

One-Hot-Encoding Spalten mit Werten auffüllen.

Von den original-kartegorischen Spaltenwerten zu den neuen Werten in den neuen One-Hot-Encoded Spalten:

Wir erinnern uns: Ziel liegt darin, die die Spalten mit dem Textwert ‚yes‘ / ‚no‘ in numerische Binärwerte umzuwandeln

  • eq("yes"):

    • vergleicht elementweise: Ist der Wert in der Spalte gleich "yes"?

    • Ergebnis ist eine Boolean-Serie: True für "yes", False für "no".

    • falls Spalte bereits als boolean vorliegt, bräuchte man das ggf nicht, da direkt multipliziert werden kann.

  • mul(1)

    • elementweise Multiplikation und mappt Booleans zu Zahlen (True wird intern als 1 interpretiert und False als 0), daher wird True * 1 => 1 und False * 1 => 0. Somit wird aus ‚yes/no‘ effektiv 1/0.

=> d.h. man könnte auch df[col].map({"yes":1,"no":0}) nutzen anstatt df[col].eq("yes").mul(1).

[62]:
for col in cols_boolean:
    df[col] = df[col].eq("yes").mul(1)

df[cols_boolean]
[62]:
SeniorCitizen Partner Dependents PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling
0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ...
7027 0 0 0 0 0 0 0 0 0 0 0
7028 0 0 0 0 0 0 0 0 0 0 0
7029 0 0 0 0 0 0 0 0 0 0 0
7030 0 0 0 0 0 0 0 0 0 0 0
7031 0 0 0 0 0 0 0 0 0 0 0

7032 rows × 11 columns

[63]:
# One-Hot-Encoding für die drei kategorischen Spalten

df_onehot_encoded = pd.get_dummies(df, columns=cols_cat) # dtype=int

print(df_onehot_encoded.shape)
df_onehot_encoded
(7032, 29)
[63]:
gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check
0 1 0 0 0 1 0 0 0 0 0 0 0 0 29.85 29.85 0 False True False True False False True False False False False True False
1 0 0 0 0 34 0 0 0 0 0 0 0 0 56.95 1889.50 0 True False False True False False False True False False False False True
2 0 0 0 0 2 0 0 0 0 0 0 0 0 53.85 108.15 1 True False False True False False True False False False False False True
3 0 0 0 0 45 0 0 0 0 0 0 0 0 42.30 1840.75 0 False True False True False False False True False True False False False
4 1 0 0 0 2 0 0 0 0 0 0 0 0 70.70 151.65 1 True False False False True False True False False False False True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 0 0 0 0 24 0 0 0 0 0 0 0 0 84.80 1990.50 0 False False True True False False False True False False False False True
7028 1 0 0 0 72 0 0 0 0 0 0 0 0 103.20 7362.90 0 False False True False True False False True False False True False False
7029 1 0 0 0 11 0 0 0 0 0 0 0 0 29.60 346.45 0 False True False True False False True False False False False True False
7030 0 0 0 0 4 0 0 0 0 0 0 0 0 74.40 306.60 1 False False True False True False True False False False False False True
7031 0 0 0 0 66 0 0 0 0 0 0 0 0 105.65 6844.50 0 True False False False True False False False True True False False False

7032 rows × 29 columns

[64]:
df_onehot_encoded.columns.tolist()
[64]:
['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'tenure',
 'PhoneService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'PaperlessBilling',
 'MonthlyCharges',
 'TotalCharges',
 'Churn',
 'MultipleLines_No',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaymentMethod_Bank transfer (automatic)',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check']
[65]:
index_label = df_onehot_encoded.columns.tolist().index("Churn")
print(df.columns.tolist()[index_label])
cols_onehot_encoded = df_onehot_encoded.columns.tolist()[index_label+1:]
print(len(cols_onehot_encoded))
cols_onehot_encoded
PaperlessBilling
13
[65]:
['MultipleLines_No',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaymentMethod_Bank transfer (automatic)',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check']
[ ]:

One-Hot-Encoding Spalten mit Werten auffüllen. (genaue Erklärung siehe oben)

Die df[col].eq("yes").mul(1) transformiert lediglich von "yes" zu 1 und alles andere zu 0. D.h. man könnte auch df[col].map({"yes":1, "no":0}) nutzen anstatt df[col].eq("yes").mul(1).

[66]:
for col in cols_onehot_encoded:
    df_onehot_encoded[col] = df_onehot_encoded[col].eq('yes').mul(1)
[67]:
df_onehot_encoded
[67]:
gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check
0 1 0 0 0 1 0 0 0 0 0 0 0 0 29.85 29.85 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 34 0 0 0 0 0 0 0 0 56.95 1889.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 2 0 0 0 0 0 0 0 0 53.85 108.15 1 0 0 0 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 45 0 0 0 0 0 0 0 0 42.30 1840.75 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 2 0 0 0 0 0 0 0 0 70.70 151.65 1 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 0 0 0 0 24 0 0 0 0 0 0 0 0 84.80 1990.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7028 1 0 0 0 72 0 0 0 0 0 0 0 0 103.20 7362.90 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7029 1 0 0 0 11 0 0 0 0 0 0 0 0 29.60 346.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0
7030 0 0 0 0 4 0 0 0 0 0 0 0 0 74.40 306.60 1 0 0 0 0 0 0 0 0 0 0 0 0 0
7031 0 0 0 0 66 0 0 0 0 0 0 0 0 105.65 6844.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0

7032 rows × 29 columns

Split the Data to x_train, x_test, y_train, y_test

[68]:
# Split X and y
X = df_onehot_encoded.drop(columns = ["Churn"])
y = df_onehot_encoded["Churn"].values
[69]:
# Splitting the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30, random_state = 42, stratify=y)
[70]:
def distplot(feature, frame, color="r"):
    plt.figure(figsize=(8,3))
    plt.title("Verteilung für {}".format(feature))
    ax = sns.distplot(frame[feature], color= color)

num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
for feat in num_cols: distplot(feat, df)
../_images/1intro_customer-churn_105_0.png
../_images/1intro_customer-churn_105_1.png
../_images/1intro_customer-churn_105_2.png

Überlegung: Welchen Scaler sollte ich nutzen?

Da die numerischen Features über verschiedene Wertebereiche (value ranges) verteilt sind, können wir hier den Standard Scalar nutzen, um sie alle auf den gleichen Bereich runterzuskalieren.

Normalisiere die numerischen Attribute

[71]:
df_std = pd.DataFrame(StandardScaler().fit_transform(df[num_cols].astype('float64')),
                       columns=num_cols)
for feat in numerical_cols: distplot(feat, df_std, color='c')
../_images/1intro_customer-churn_108_0.png
../_images/1intro_customer-churn_108_1.png
../_images/1intro_customer-churn_108_2.png

[72]:
scaler= StandardScaler()

X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

Modellieren

Es handelt sich hierbei um einen Datensatz mit klar definierten Labels (Churn: Yes oder No), die eine Klassifikationsproblematik beschreiben.

Dazu werden wir nun mehrere Klassifikationsmodelle testen.

K-Nearest Neighbor Classifier

Der K-Nearest-Neighbor Classifier ist eine nichtparametrische Methode zur Schätzung von Wahrscheinlichkeitsdichtefunktionen, mehr Details: https://de.wikipedia.org/wiki/N%C3%A4chste-Nachbarn-Klassifikation

Hier nutzen wir die kNN-Implemtation von scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

[73]:
# Initiere das knn-Modell

knn_model = KNeighborsClassifier(n_neighbors = 11)
[74]:
# Trainiere das Modell mit unseren Trainingsdaten

knn_model.fit(X_train,y_train)
[74]:
KNeighborsClassifier(n_neighbors=11)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[75]:
# die "score"-Funktion scoret direkt die predicted Resultate vom X_test gegen die wahren y-Werte und gibt die Accuracy zurück

accuracy_knn = knn_model.score(X_test,y_test)
print("KNN accuracy:",accuracy_knn)
print(knn_model.score(X_test,y_test))
KNN accuracy: 0.7611374407582938
0.7611374407582938
[76]:
# Alternativ kann man direkt mit dem Modell auf den X_test (Testdatensatz) vorhersagen und diese Vorhersagen gegen die wahren Werte des Testdatensatzes mit sklearn evaluieren

y_pred = knn_model.predict(X_test)
prec = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)
print("precision: ", prec)
print("recall: ", recall)
precision:  [0.81119714 0.56612529]
recall:  [0.87927695 0.43493761]
[77]:
# Zudem gibt es den "Classification Report", der direkt mehrere Metriken Resultate ausgibt

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.81      0.88      0.84      1549
           1       0.57      0.43      0.49       561

    accuracy                           0.76      2110
   macro avg       0.69      0.66      0.67      2110
weighted avg       0.75      0.76      0.75      2110

[78]:
# Predict directly with the trained model

y_pred = knn_model.predict(X_test)
print(y_pred)
[0 1 0 ... 0 0 0]

Logistic Regression Classifier

Der Logistic Regression Classifier basiert auf Linear Regression, nur gibt sie am Ende eine

[79]:
lr_model = LogisticRegression()
lr_model.fit(X_train,y_train)
accuracy_lr = lr_model.score(X_test,y_test)
print("Logistic Regression accuracy is :",accuracy_lr)
Logistic Regression accuracy is : 0.776303317535545
[80]:
# Make predictions
lr_pred= lr_model.predict(X_test)
report = classification_report(y_test,lr_pred)
print(report)
              precision    recall  f1-score   support

           0       0.81      0.90      0.86      1549
           1       0.61      0.42      0.50       561

    accuracy                           0.78      2110
   macro avg       0.71      0.66      0.68      2110
weighted avg       0.76      0.78      0.76      2110

[81]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, lr_pred),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("LOGISTIC REGRESSION CONFUSION MATRIX",fontsize=14)
plt.show()
../_images/1intro_customer-churn_122_0.png

Decision Tree Classifier

[82]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train,y_train)
predictdt_y = dt_model.predict(X_test)
accuracy_dt = dt_model.score(X_test,y_test)
print("Decision Tree accuracy is :",accuracy_dt)
Decision Tree accuracy is : 0.7018957345971564
[83]:
print(classification_report(y_test, predictdt_y))
              precision    recall  f1-score   support

           0       0.80      0.79      0.80      1549
           1       0.44      0.45      0.45       561

    accuracy                           0.70      2110
   macro avg       0.62      0.62      0.62      2110
weighted avg       0.70      0.70      0.70      2110

AdaBoost Classifier

[84]:
a_model = AdaBoostClassifier()
a_model.fit(X_train,y_train)
a_preds = a_model.predict(X_test)
print("AdaBoost Classifier accuracy")
metrics.accuracy_score(y_test, a_preds)
AdaBoost Classifier accuracy
[84]:
0.7838862559241706
[85]:
print(classification_report(y_test, a_preds))
              precision    recall  f1-score   support

           0       0.82      0.90      0.86      1549
           1       0.63      0.47      0.53       561

    accuracy                           0.78      2110
   macro avg       0.72      0.68      0.70      2110
weighted avg       0.77      0.78      0.77      2110

[86]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, a_preds),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("AdaBoost Classifier Confusion Matrix",fontsize=14)
plt.show()
../_images/1intro_customer-churn_129_0.png

Gradient Boosting Classifier

[87]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)
print("Gradient Boosting Classifier", accuracy_score(y_test, gb_pred))
Gradient Boosting Classifier 0.7781990521327015
[88]:
print(classification_report(y_test, gb_pred))
              precision    recall  f1-score   support

           0       0.82      0.90      0.86      1549
           1       0.62      0.44      0.51       561

    accuracy                           0.78      2110
   macro avg       0.72      0.67      0.68      2110
weighted avg       0.76      0.78      0.76      2110

[89]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, gb_pred),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("Gradient Boosting Classifier Confusion Matrix",fontsize=14)
plt.show()
../_images/1intro_customer-churn_133_0.png

Voting Classifier

Der Voting Classifier ist ein Mixture Modell, der mehrere Modelle vereint, die alle ihre Predictions machen und dann über Majority voting entschieden wird, welche Prediction Output rausgegeben wird.

[90]:
from sklearn.ensemble import VotingClassifier
clf1 = GradientBoostingClassifier()
clf2 = LogisticRegression()
clf3 = AdaBoostClassifier()
eclf1 = VotingClassifier(estimators=[("gbc", clf1), ("lr", clf2), ("abc", clf3)], voting="soft")
eclf1.fit(X_train, y_train)
predictions = eclf1.predict(X_test)
print("Final Accuracy Score ")
print(accuracy_score(y_test, predictions))
Final Accuracy Score
0.7834123222748816
[91]:
print(classification_report(y_test, predictions))
              precision    recall  f1-score   support

           0       0.82      0.91      0.86      1549
           1       0.63      0.44      0.52       561

    accuracy                           0.78      2110
   macro avg       0.73      0.67      0.69      2110
weighted avg       0.77      0.78      0.77      2110

[92]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, predictions),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("FINAL CONFUSION MATRIX",fontsize=14)
plt.show()
../_images/1intro_customer-churn_137_0.png