Praxisbeispiel - Telko Vertragskundenabgang-Vorhersage MIT simulierten Kunden-Service-Daten

Die Vorhersage der Kundenabwanderung (bei Vertragssituationen) sagt die Wahrscheinlichkeit voraus, dass Kunden die Produkte oder Dienstleistungen eines Unternehmens kündigen. In den meisten Fällen sind Unternehmen mit Stammkunden oder Kunden im Abonnement bestrebt, ihren Kundenstamm zu erhalten. Daher ist es wichtig, die Kunden zu verfolgen, die ihr Abonnement kündigen, und diejenigen, die den Dienst weiter nutzen. Dieser Ansatz setzt voraus, dass das Unternehmen das Verhalten seiner Kunden kennt und versteht und die Eigenschaften, die zu dem Risiko führen, dass der Kunde das Unternehmen verlässt.

Szenario

Annahme: Wir sind ein Telekommunikations-Unternehmen, das über historische Daten darüber verfügt, wie seine Kunden mit seinen Dienstleistungen interagiert haben. Das Unternehmen möchte wissen, wie hoch die Wahrscheinlichkeit ist, dass Kunden abwandern, damit es gezielte Marketingkampagnen starten kann.

Datensatz

  1. Wir nutzen den Kaggle-Datensatz als die Baseline der Kundeninformationen: https://www.kaggle.com/datasets/blastchar/telco-customer-churn

  2. Wir nutzen einen selbst simulierten Datensatz bzgl. Kunden-Service, das vorher in diesem Folder abgespeichert wurde: kunden_service.csv

[1]:
import sys
# !{sys.executable} -m pip install kagglehub
# !{sys.executable} -m pip install @
# !{sys.executable} -m pip install missingno
[2]:
import kagglehub
[3]:
import pandas as pd
import numpy as np
# import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
# import plotly.express as px
# import plotly.graph_objects as go
# from plotly.subplots import make_subplots
import warnings

pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")
[4]:
from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# from xgboost import XGBClassifier
# from catboost import CatBoostClassifier
[5]:
# Download data from Kaggle
path = kagglehub.dataset_download("blastchar/telco-customer-churn")

print("Path to files:", path)
Path to files: /Users/veit/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1
[6]:
path_file = "~/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1/WA_Fn-UseC_-Telco-Customer-Churn.csv"
print(path_file)
~/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1/WA_Fn-UseC_-Telco-Customer-Churn.csv

Daten laden

[7]:
df_ori = pd.read_csv(path_file)
df_ori
[7]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.5 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7038 6840-RESVB Male 0 Yes Yes 24 Yes Yes DSL Yes No Yes Yes Yes Yes One year Yes Mailed check 84.80 1990.5 No
7039 2234-XADUH Female 0 Yes Yes 72 Yes Yes Fiber optic No Yes Yes No Yes Yes One year Yes Credit card (automatic) 103.20 7362.9 No
7040 4801-JZAZL Female 0 Yes Yes 11 No No phone service DSL Yes No No No No No Month-to-month Yes Electronic check 29.60 346.45 No
7041 8361-LTMKD Male 1 Yes No 4 Yes Yes Fiber optic No No No No No No Month-to-month Yes Mailed check 74.40 306.6 Yes
7042 3186-AJIEK Male 0 No No 66 Yes No Fiber optic Yes No Yes Yes Yes Yes Two year Yes Bank transfer (automatic) 105.65 6844.5 No

7043 rows × 21 columns

[8]:
path_kunden_service_log = "kunden_service.csv"
df_service = pd.read_csv(path_kunden_service_log)
df_service
[8]:
ticket_id customer_id time_request time_reply channel request reply solved original_request original_ticket_id
0 7590-VHVEG_T1 7590-VHVEG 2024-12-23 00:00:00 2024-12-23 14:28:40 On-site Internet funktioniert nicht Ticket geschlossen True True NaN
1 7590-VHVEG_T1 7590-VHVEG 2024-12-23 14:28:40 2024-12-23 20:20:22 Hotline Internet funktioniert nicht Anfrage wird bearbeitet False False 7590-VHVEG_T1
2 5575-GNVDE_T1 5575-GNVDE 2023-09-03 00:00:00 2023-09-03 12:48:24 Email Internet funktioniert nicht Problem gelöst True True NaN
3 5575-GNVDE_T1 5575-GNVDE 2023-09-04 12:48:24 2023-09-05 08:04:13 Social Media Frage zum Vertrag Weitere Informationen benötigt False False 5575-GNVDE_T1
4 5575-GNVDE_T2 5575-GNVDE 2024-12-16 00:00:00 2024-12-17 20:29:09 On-site Internet funktioniert nicht Weitere Informationen benötigt False True NaN
... ... ... ... ... ... ... ... ... ... ...
28153 6840-RESVB_T1 6840-RESVB 2024-03-18 20:03:20 2024-03-18 23:24:42 Email Beschwerde über Kundenservice Ticket geschlossen True False 6840-RESVB_T1
28154 4801-JZAZL_T1 4801-JZAZL 2024-08-11 00:00:00 2024-08-11 11:26:24 Hotline Problem mit Rechnung Ticket geschlossen True True NaN
28155 4801-JZAZL_T1 4801-JZAZL 2024-08-11 11:26:24 2024-08-12 13:54:33 Email Problem mit Rechnung Anfrage wird bearbeitet False False 4801-JZAZL_T1
28156 8361-LTMKD_T1 8361-LTMKD 2024-10-16 00:00:00 2024-10-16 17:35:16 On-site Problem mit Rechnung Problem gelöst True True NaN
28157 8361-LTMKD_T1 8361-LTMKD 2024-10-16 17:35:16 2024-10-16 20:38:36 Hotline Beschwerde über Kundenservice Weitere Informationen benötigt False False 8361-LTMKD_T1

28158 rows × 10 columns

Daten Manipulation / Bereinigung

Original Kunden-Datensatz

[9]:
df_ori["TotalCharges"] = pd.to_numeric(df_ori.TotalCharges, errors="coerce")
[10]:
df_ori.drop(labels=df_ori[df_ori["tenure"] == 0].index, axis=0, inplace=True)
df_ori.head()
[10]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
[11]:
df_ori.reset_index(inplace=True)
df_ori.head()
[11]:
index customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes
[12]:
if "index" in df_ori:
    df_ori.drop("index", axis=1, inplace=True)
df_ori.head()
[12]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService MultipleLines InternetService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies Contract PaperlessBilling PaymentMethod MonthlyCharges TotalCharges Churn
0 7590-VHVEG Female 0 Yes No 1 No No phone service DSL No Yes No No No No Month-to-month Yes Electronic check 29.85 29.85 No
1 5575-GNVDE Male 0 No No 34 Yes No DSL Yes No Yes No No No One year No Mailed check 56.95 1889.50 No
2 3668-QPYBK Male 0 No No 2 Yes No DSL Yes Yes No No No No Month-to-month Yes Mailed check 53.85 108.15 Yes
3 7795-CFOCW Male 0 No No 45 No No phone service DSL Yes No Yes Yes No No One year No Bank transfer (automatic) 42.30 1840.75 No
4 9237-HQITU Female 0 No No 2 Yes No Fiber optic No No No No No No Month-to-month Yes Electronic check 70.70 151.65 Yes

Kunden-IDs?

Achtung! Die customerIDs können jetzt nicht so schnell verworfen werden! Die brauchen wir, um die beiden Tabellen miteinander zu verknüpfen!

Kategorische Spalten umwandeln

Um die kategorischen Spalten richtig in numerische Spalten umzuwandeln, lass uns sie genauer anschauen und identifizieren, was sich hinter den Strings wirklich verbirgt.

DatenAnalyse:

binary: gender,

boolean: SeniorCitizen, Partner, Dependents, PhoneService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, PaperlessBilling

categorical: MultipleLines, InternetService, Contract, PaymentMethod

numeric: tenure, MonthlyCharges, TotalCharges

Unterteile die Columns in 3 Kategorien: Standardisierung, Ordinal-Encoding and One-Hot-Encoding

binary (direkte Umwandlung zu 0-1): gender

boolean (direkte Umwandlung zu 0-1): SeniorCitizen, Partner, Dependents, PhoneService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, PaperlessBilling

categorical (One-Hot oder Frequency Encoding etc): MultipleLines, InternetService, Contract, PaymentMethod

numeric: tenure, MonthlyCharges, TotalCharges

[13]:
cols_ori_binary = ["gender"]
cols_ori_boolean = ["SeniorCitizen", "Partner", "Dependents", "PhoneService", "OnlineSecurity", "OnlineBackup",
               "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "PaperlessBilling"]
cols_ori_cat = ["MultipleLines", "InternetService", "Contract", "PaymentMethod"]
cols_ori_numeric = ["tenure", "MonthlyCharges", "TotalCharges"]
[14]:
# Mapping von "Female" zu 1 und "Male" zu 0

df_ori["gender"] = df_ori["gender"].map({"Female": 1, "Male": 0})
df_ori["gender"]
[14]:
0       1
1       0
2       0
3       0
4       1
       ..
7027    0
7028    1
7029    1
7030    0
7031    0
Name: gender, Length: 7032, dtype: int64
[15]:
for col in cols_ori_boolean:
    df_ori[col] = df_ori[col].eq("Yes").mul(1)

df_ori[cols_ori_boolean]
[15]:
SeniorCitizen Partner Dependents PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling
0 0 1 0 0 0 1 0 0 0 0 1
1 0 0 0 1 1 0 1 0 0 0 0
2 0 0 0 1 1 1 0 0 0 0 1
3 0 0 0 0 1 0 1 1 0 0 0
4 0 0 0 1 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ...
7027 0 1 1 1 1 0 1 1 1 1 1
7028 0 1 1 1 0 1 1 0 1 1 1
7029 0 1 1 0 1 0 0 0 0 0 1
7030 0 1 0 1 0 0 0 0 0 0 1
7031 0 0 0 1 1 0 1 1 1 1 1

7032 rows × 11 columns

[16]:
# One-Hot-Encoding für die drei kategorischen Spalten
df_onehot_encoded = pd.get_dummies(df_ori, columns=cols_ori_cat) # dtype=int

print(df_onehot_encoded.shape)
df_onehot_encoded
(7032, 30)
[16]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check
0 7590-VHVEG 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 No False True False True False False True False False False False True False
1 5575-GNVDE 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 No True False False True False False False True False False False False True
2 3668-QPYBK 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 Yes True False False True False False True False False False False False True
3 7795-CFOCW 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 No False True False True False False False True False True False False False
4 9237-HQITU 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 Yes True False False False True False True False False False False True False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 6840-RESVB 0 0 1 1 24 1 1 0 1 1 1 1 1 84.80 1990.50 No False False True True False False False True False False False False True
7028 2234-XADUH 1 0 1 1 72 1 0 1 1 0 1 1 1 103.20 7362.90 No False False True False True False False True False False True False False
7029 4801-JZAZL 1 0 1 1 11 0 1 0 0 0 0 0 1 29.60 346.45 No False True False True False False True False False False False True False
7030 8361-LTMKD 0 0 1 0 4 1 0 0 0 0 0 0 1 74.40 306.60 Yes False False True False True False True False False False False False True
7031 3186-AJIEK 0 0 0 0 66 1 1 0 1 1 1 1 1 105.65 6844.50 No True False False False True False False False True True False False False

7032 rows × 30 columns

[17]:
df_onehot_encoded.columns.tolist()
[17]:
['customerID',
 'gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'tenure',
 'PhoneService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'PaperlessBilling',
 'MonthlyCharges',
 'TotalCharges',
 'Churn',
 'MultipleLines_No',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaymentMethod_Bank transfer (automatic)',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check']
[18]:
index_label = df_onehot_encoded.columns.tolist().index("Churn")
print(df_ori.columns.tolist()[index_label])

cols_onehot_encoded = df_onehot_encoded.columns.tolist()[index_label+1:]
print(len(cols_onehot_encoded))
cols_onehot_encoded
PaperlessBilling
13
[18]:
['MultipleLines_No',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaymentMethod_Bank transfer (automatic)',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check']
[19]:
for col in cols_onehot_encoded:
    df_onehot_encoded[col] = df_onehot_encoded[col].eq("Yes").mul(1)
[20]:
df_onehot_encoded
[20]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check
0 7590-VHVEG 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 No 0 0 0 0 0 0 0 0 0 0 0 0 0
1 5575-GNVDE 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0
2 3668-QPYBK 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0
3 7795-CFOCW 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 No 0 0 0 0 0 0 0 0 0 0 0 0 0
4 9237-HQITU 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 6840-RESVB 0 0 1 1 24 1 1 0 1 1 1 1 1 84.80 1990.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0
7028 2234-XADUH 1 0 1 1 72 1 0 1 1 0 1 1 1 103.20 7362.90 No 0 0 0 0 0 0 0 0 0 0 0 0 0
7029 4801-JZAZL 1 0 1 1 11 0 1 0 0 0 0 0 1 29.60 346.45 No 0 0 0 0 0 0 0 0 0 0 0 0 0
7030 8361-LTMKD 0 0 1 0 4 1 0 0 0 0 0 0 1 74.40 306.60 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0
7031 3186-AJIEK 0 0 0 0 66 1 1 0 1 1 1 1 1 105.65 6844.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0

7032 rows × 30 columns

Kunden-Service Datensatz

Würden wir einen echten Kunden-Service Datensatz erhalten, müssten wir ihn genau so aufmerksam kontrollieren und checken, ob es missing fields gibt, etc.

[21]:
df_service.head()
[21]:
ticket_id customer_id time_request time_reply channel request reply solved original_request original_ticket_id
0 7590-VHVEG_T1 7590-VHVEG 2024-12-23 00:00:00 2024-12-23 14:28:40 On-site Internet funktioniert nicht Ticket geschlossen True True NaN
1 7590-VHVEG_T1 7590-VHVEG 2024-12-23 14:28:40 2024-12-23 20:20:22 Hotline Internet funktioniert nicht Anfrage wird bearbeitet False False 7590-VHVEG_T1
2 5575-GNVDE_T1 5575-GNVDE 2023-09-03 00:00:00 2023-09-03 12:48:24 Email Internet funktioniert nicht Problem gelöst True True NaN
3 5575-GNVDE_T1 5575-GNVDE 2023-09-04 12:48:24 2023-09-05 08:04:13 Social Media Frage zum Vertrag Weitere Informationen benötigt False False 5575-GNVDE_T1
4 5575-GNVDE_T2 5575-GNVDE 2024-12-16 00:00:00 2024-12-17 20:29:09 On-site Internet funktioniert nicht Weitere Informationen benötigt False True NaN
[22]:
df_service.dtypes
[22]:
ticket_id              str
customer_id            str
time_request           str
time_reply             str
channel                str
request                str
reply                  str
solved                bool
original_request      bool
original_ticket_id     str
dtype: object
[23]:
# Die datetime-Spalten müssen in deren Typ angepasst werden, alles andere kann in deren Original-Form bleiben (string)

df_service["time_request"] = pd.to_datetime(df_service["time_request"], errors="coerce")
df_service["time_reply"] = pd.to_datetime(df_service["time_reply"], errors="coerce")
df_service.dtypes
[23]:
ticket_id                        str
customer_id                      str
time_request          datetime64[us]
time_reply            datetime64[us]
channel                          str
request                          str
reply                            str
solved                          bool
original_request                bool
original_ticket_id               str
dtype: object
[24]:
df_service.isnull().sum()
[24]:
ticket_id                 0
customer_id               0
time_request              0
time_reply                0
channel                   0
request                   0
reply                     0
solved                    0
original_request          0
original_ticket_id    14140
dtype: int64

Die große Anzahl an null-Werten in der original_ticket_id-Spalte sind in diesem Datensatz kein Fehler, sondern einleuchtend in der Logik (das wissen wir - haben wir selbst kreiert).

Daher werden hier keine Daten gelöscht, sondern wir würden die leeren Datenfelder extrapolieren.

Da es sich in dieser Spalte um alphanumerischen IDs handelt, ist die naheliegendste Lösung einen leeren String zu machen. Es ist keine Option, sie bei null zu belassen, da die Funktionen in den follow-up Schritten eine vollständig gefüllte Spalte benötigen.

[25]:
df_service["original_ticket_id"].fillna("", inplace=True)
[25]:
0
1        7590-VHVEG_T1
2
3        5575-GNVDE_T1
4
             ...
28153    6840-RESVB_T1
28154
28155    4801-JZAZL_T1
28156
28157    8361-LTMKD_T1
Name: original_ticket_id, Length: 28158, dtype: str
[26]:
df_service.isnull().sum()
[26]:
ticket_id                 0
customer_id               0
time_request              0
time_reply                0
channel                   0
request                   0
reply                     0
solved                    0
original_request          0
original_ticket_id    14140
dtype: int64
[27]:
# Dann wollen wir checken, ob es irgendein Datum gibt, das nach dem Stichtag ist
df_service["time_request"].max()
[27]:
Timestamp('2024-12-31 23:58:29')
[28]:
df_service["time_reply"].max()
[28]:
Timestamp('2024-12-31 23:59:39')
[29]:
# Zudem wollen wir kontrollieren, ob es Fehler in der Logik gibt, z.b. time_reply vor time_request

df_service[df_service["time_reply"]<df_service["time_request"]]
[29]:
ticket_id customer_id time_request time_reply channel request reply solved original_request original_ticket_id
[30]:
df_service.head()
[30]:
ticket_id customer_id time_request time_reply channel request reply solved original_request original_ticket_id
0 7590-VHVEG_T1 7590-VHVEG 2024-12-23 00:00:00 2024-12-23 14:28:40 On-site Internet funktioniert nicht Ticket geschlossen True True NaN
1 7590-VHVEG_T1 7590-VHVEG 2024-12-23 14:28:40 2024-12-23 20:20:22 Hotline Internet funktioniert nicht Anfrage wird bearbeitet False False 7590-VHVEG_T1
2 5575-GNVDE_T1 5575-GNVDE 2023-09-03 00:00:00 2023-09-03 12:48:24 Email Internet funktioniert nicht Problem gelöst True True NaN
3 5575-GNVDE_T1 5575-GNVDE 2023-09-04 12:48:24 2023-09-05 08:04:13 Social Media Frage zum Vertrag Weitere Informationen benötigt False False 5575-GNVDE_T1
4 5575-GNVDE_T2 5575-GNVDE 2024-12-16 00:00:00 2024-12-17 20:29:09 On-site Internet funktioniert nicht Weitere Informationen benötigt False True NaN

EDA - Explorative Data Analysis (Daten Exploration)

Die EDA (Explorative Daten Analysis) des df_ori lassen wir in diesem Notebook aus, da wir es im letzten Notebook ausgiebig besprochen haben

[31]:
df_service.value_counts("channel", ascending=False)
[31]:
channel
On-site         5744
Chat            5645
Social Media    5625
Hotline         5593
Email           5551
Name: count, dtype: int64
[32]:
df_service.value_counts("request")
[32]:
request
Internet funktioniert nicht      5769
Problem mit Rechnung             5675
Frage zum Vertrag                5597
Beschwerde über Kundenservice    5560
Technische Störung               5557
Name: count, dtype: int64
[33]:
df_service[["request", "reply"]].groupby(["request"]).value_counts(ascending=False)
[33]:
request                        reply
Beschwerde über Kundenservice  Weitere Informationen benötigt    1401
                               Anfrage wird bearbeitet           1390
                               Ticket geschlossen                1387
                               Problem gelöst                    1382
Frage zum Vertrag              Ticket geschlossen                1429
                               Weitere Informationen benötigt    1420
                               Anfrage wird bearbeitet           1407
                               Problem gelöst                    1341
Internet funktioniert nicht    Problem gelöst                    1477
                               Anfrage wird bearbeitet           1476
                               Ticket geschlossen                1417
                               Weitere Informationen benötigt    1399
Problem mit Rechnung           Ticket geschlossen                1464
                               Weitere Informationen benötigt    1450
                               Problem gelöst                    1401
                               Anfrage wird bearbeitet           1360
Technische Störung             Weitere Informationen benötigt    1411
                               Anfrage wird bearbeitet           1409
                               Problem gelöst                    1409
                               Ticket geschlossen                1328
Name: count, dtype: int64
[34]:
df_service.value_counts("solved", ascending=False)
[34]:
solved
False    14123
True     14035
Name: count, dtype: int64
[35]:
df_service[["request","solved"]].groupby(["request"]).value_counts(ascending=False)
[35]:
request                        solved
Beschwerde über Kundenservice  False     2791
                               True      2769
Frage zum Vertrag              False     2827
                               True      2770
Internet funktioniert nicht    True      2894
                               False     2875
Problem mit Rechnung           True      2865
                               False     2810
Technische Störung             False     2820
                               True      2737
Name: count, dtype: int64
[36]:
df_service.groupby(df_service["time_request"].dt.month).count().plot(kind="bar")
[36]:
<Axes: xlabel='time_request'>
../_images/4gen_ai_customer-churn-2_53_1.png
[37]:
df_service.groupby(df_service["time_reply"].dt.month).count().plot(kind="bar")
[37]:
<Axes: xlabel='time_reply'>
../_images/4gen_ai_customer-churn-2_54_1.png

Feature Engineering

[38]:
# Definiere den Stichtag für die Features
stichtag = pd.to_datetime("2024-12-31")

# Wir werden viele verschiedene Features erstellen, diese werden erst gesammelt und später zum DataFrame umgewandelt
dict_features = {}

Feature-Gruppe 1: Anzahl der Requests im letzten Monat

[39]:
# Definiere den Cutoff für die letzten 1 Monat
cutoff_1m = stichtag - pd.DateOffset(months=1)

# Filtere die Anfragen, die ab dem Cutoff bis zum Stichtag erfolgen
recent_requests = df_service[df_service["time_request"] >= cutoff_1m]

# Zähle pro Kunde die Anzahl der Requests
num_request_last_1m = recent_requests.groupby("customer_id").size().rename("num_request_last_1_month")

# Ausgabe: Beispieldatensatz
print(num_request_last_1m.head())
customer_id
0013-EXCHZ     6
0015-UOCOJ     1
0021-IKXGC    13
0023-HGHWL     3
0030-FNXPP     3
Name: num_request_last_1_month, dtype: int64
[40]:
# Anzahl Requests in den letzten 6, 9 etc Monaten

months_interest = [1, 3, 6, 12]

for mon in months_interest:
    # Definiere den Cutoff für die letzten 3 Monate
    cutoff = stichtag - pd.DateOffset(months=mon)

    # Filtere die Anfragen, die ab dem Cutoff bis zum Stichtag erfolgen
    recent_requests = df_service[df_service["time_request"] >= cutoff]

    # Zähle pro Kunde die Anzahl der Requests
    col_new = "num_request_last_"+str(mon)+"_months"
    num_request_period = recent_requests.groupby("customer_id").size().rename(col_new)

    # Füge das Ergebnis dem features_df hinzu. Da die customer_ids als Index genutzt werden, eignet sich join()
    # df_features = df_features.join(num_request_period, how='left')
    dict_features[col_new] = num_request_period

print(dict_features.keys())
dict_features
dict_keys(['num_request_last_1_months', 'num_request_last_3_months', 'num_request_last_6_months', 'num_request_last_12_months'])
[40]:
{'num_request_last_1_months': customer_id
 0013-EXCHZ     6
 0015-UOCOJ     1
 0021-IKXGC    13
 0023-HGHWL     3
 0030-FNXPP     3
               ..
 9972-NKTFD     1
 9975-SKRNR     3
 9985-MWVIX     3
 9986-BONCE     3
 9992-UJOEL     1
 Name: num_request_last_1_months, Length: 1365, dtype: int64,
 'num_request_last_3_months': customer_id
 0003-MKNFE    3
 0004-TLHLJ    3
 0013-EXCHZ    9
 0015-UOCOJ    1
 0018-NYROU    1
              ..
 9986-BONCE    9
 9987-LUTYD    3
 9992-RRAMN    3
 9992-UJOEL    1
 9995-HOTOH    2
 Name: num_request_last_3_months, Length: 2338, dtype: int64,
 'num_request_last_6_months': customer_id
 0002-ORFBO     1
 0003-MKNFE     4
 0004-TLHLJ     3
 0013-EXCHZ     9
 0013-MHZWF     4
               ..
 9986-BONCE    11
 9987-LUTYD     8
 9992-RRAMN     6
 9992-UJOEL     1
 9995-HOTOH     2
 Name: num_request_last_6_months, Length: 3191, dtype: int64,
 'num_request_last_12_months': customer_id
 0002-ORFBO     3
 0003-MKNFE     4
 0004-TLHLJ     3
 0011-IGKFF     1
 0013-EXCHZ     9
               ..
 9986-BONCE    11
 9987-LUTYD     8
 9992-RRAMN     6
 9992-UJOEL     1
 9995-HOTOH     2
 Name: num_request_last_12_months, Length: 4198, dtype: int64}

Feature-Gruppe 2 & 3: Zeit bis Lösung und Anzahl der Thread-Iterationen pro Ticket

Hierfür müssen wir zuerst nach Ticket-ID gruppieren, um pro Ticket den Zeitraum von der ursprünglichen Anfrage bis zur ersten als gelöst markierten Antwort zu bestimmen.

Falls ein Ticket nie als gelöst markiert wurde, ignorieren wir es in der Berechnung.

[41]:
def compute_ticket_metrics(ticket):
    """
    Für ein Ticket (alle Interaktionen mit gleicher ticket_id):
      - Ermittle die Zeitdifferenz (in Tagen) zwischen der ersten Anfrage und der ersten Antwort, die das Ticket als gelöst markiert.
      - Bestimme die Anzahl der Interaktionen (Thread-Iterationen) bis zur Lösung.
    """
    ticket_sorted = ticket.sort_values("time_request")

    # Die erste Anfrage (wir gehen davon aus, dass die erste Zeile die originale Anfrage ist)
    orig_time = ticket_sorted.iloc[0]["time_request"]

    # Finde die erste Interaktion, bei der solved True ist
    solved_interactions = ticket_sorted[ticket_sorted["solved"] == True]

    if len(solved_interactions) > 0:
        solved_time = solved_interactions.iloc[0]["time_reply"]
        days_to_solution = (solved_time - orig_time).days
        # Ermittle die Position (Index im sortierten Ticket), beginnend bei 1
        thread_count = ticket_sorted.reset_index().index[ticket_sorted.reset_index()["solved"] == True][0] + 1
    else:
        days_to_solution = np.nan
        thread_count = np.nan
    return pd.Series({"days_to_solution": days_to_solution, "thread_count": thread_count})
[42]:
# Berechne Ticket-Metriken pro Ticket
ticket_metrics = df_service.groupby("ticket_id").apply(compute_ticket_metrics)
ticket_metrics.head()
[42]:
days_to_solution thread_count
ticket_id
0002-ORFBO_T1 0.0 1.0
0003-MKNFE_T1 NaN NaN
0003-MKNFE_T2 0.0 1.0
0004-TLHLJ_T1 0.0 1.0
0011-IGKFF_T1 NaN NaN
[43]:
df_service[df_service["ticket_id"]=="0003-MKNFE_T1"]
[43]:
ticket_id customer_id time_request time_reply channel request reply solved original_request original_ticket_id
13706 0003-MKNFE_T1 0003-MKNFE 2024-08-08 2024-08-09 11:28:34 On-site Beschwerde über Kundenservice Anfrage wird bearbeitet False True NaN
[44]:
# Füge die customer_id (die für alle Interaktionen in einem Ticket gleich ist) hinzu
ticket_customer = df_service.groupby("ticket_id")["customer_id"].first().to_frame()
ticket_metrics = ticket_metrics.merge(ticket_customer, left_index=True, right_index=True)
ticket_metrics
[44]:
days_to_solution thread_count customer_id
ticket_id
0002-ORFBO_T1 0.0 1.0 0002-ORFBO
0003-MKNFE_T1 NaN NaN 0003-MKNFE
0003-MKNFE_T2 0.0 1.0 0003-MKNFE
0004-TLHLJ_T1 0.0 1.0 0004-TLHLJ
0011-IGKFF_T1 NaN NaN 0011-IGKFF
... ... ... ...
9992-RRAMN_T3 1.0 1.0 9992-RRAMN
9992-UJOEL_T1 NaN NaN 9992-UJOEL
9993-LHIEB_T1 1.0 1.0 9993-LHIEB
9995-HOTOH_T1 1.0 2.0 9995-HOTOH
9995-HOTOH_T2 1.0 1.0 9995-HOTOH

14140 rows × 3 columns

[45]:
# Aggregiere pro Kunde (nur Tickets, die gelöst wurden)
avg_days_till_solution = ticket_metrics.groupby("customer_id")["days_to_solution"].mean().rename("avg_days_till_solution")
avg_thread_till_solution = ticket_metrics.groupby("customer_id")["thread_count"].mean().rename("avg_thread_till_solution")

print(avg_days_till_solution.head())
print(avg_thread_till_solution.head())
customer_id
0002-ORFBO    0.0
0003-MKNFE    0.0
0004-TLHLJ    0.0
0011-IGKFF    NaN
0013-EXCHZ    1.0
Name: avg_days_till_solution, dtype: float64
customer_id
0002-ORFBO    1.00
0003-MKNFE    1.00
0004-TLHLJ    1.00
0011-IGKFF     NaN
0013-EXCHZ    1.25
Name: avg_thread_till_solution, dtype: float64
[46]:
dict_features["avg_num_days_till_solution"] = avg_days_till_solution
dict_features["avg_num_thread_till_solution"] = avg_thread_till_solution
[47]:
dict_features.keys()
[47]:
dict_keys(['num_request_last_1_months', 'num_request_last_3_months', 'num_request_last_6_months', 'num_request_last_12_months', 'avg_num_days_till_solution', 'avg_num_thread_till_solution'])

Feature-Gruppe 4 & 5: Zuletzt benutzter Kanal und am häufigsten genutzter Kanal in den letzten 6 Monaten

[48]:
# Last channel used: Finde pro Kunde die letzte Interaktion (Sortierung nach time_request)
last_channel = df_service.sort_values("time_request").groupby("customer_id").last()["channel"].rename("last_channel_used")

# den meist genutzten Channel in letzten 6 Monaten
# zuerst den Zeitraum auf die letzten 6 Monate eingrenzen
cutoff_6m = stichtag - pd.DateOffset(months=6)
channels_last_6m = df_service[df_service["time_request"] >= cutoff_6m]

# Bestimme pro Kunde den am häufigsten auftretenden Kanal in diesem Zeitraum. Falls es mehrere Modi gibt, nimm den ersten.
channel_most_used_last6mo = channels_last_6m.groupby("customer_id")["channel"].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)\
                                   .rename("channel_used_most_last_6_months")

print(last_channel.head())
print(channel_most_used_last6mo.head())
customer_id
0002-ORFBO         Hotline
0003-MKNFE    Social Media
0004-TLHLJ    Social Media
0011-IGKFF    Social Media
0013-EXCHZ           Email
Name: last_channel_used, dtype: str
customer_id
0002-ORFBO    Hotline
0003-MKNFE      Email
0004-TLHLJ       Chat
0013-EXCHZ       Chat
0013-MHZWF    On-site
Name: channel_used_most_last_6_months, dtype: str
[49]:
dict_features["last_channel"] = last_channel
dict_features["channel_most_used_last6mo"] = channel_most_used_last6mo
[50]:
dict_features.keys()
[50]:
dict_keys(['num_request_last_1_months', 'num_request_last_3_months', 'num_request_last_6_months', 'num_request_last_12_months', 'avg_num_days_till_solution', 'avg_num_thread_till_solution', 'last_channel', 'channel_most_used_last6mo'])

Feature-Gruppe 5: Weitere Features – Gesamtzahl Tickets, Interaktionen, durchschnittliche Reply-Zeit und textbasierte Features

[51]:
# Gesamtzahl der Tickets pro Kunde
total_tickets = df_service.groupby("customer_id")["ticket_id"].nunique().rename("total_tickets")

# Gesamtzahl der Interaktionen pro Kunde
total_interactions = df_service.groupby("customer_id").size().rename("total_interactions")

# Durchschnittliche Reply-Verzögerung (in Stunden) pro Interaktion
df_service["reply_delay_hours"] = (df_service["time_reply"] - df_service["time_request"]).dt.total_seconds() / 3600
avg_reply_delay = df_service.groupby("customer_id")["reply_delay_hours"].mean().rename("avg_reply_delay_hours")

# Textbasierte Features:
# Zähle, wie oft im Request-Text Begriffe wie "Problem" oder "Beschwerde" auftauchen
df_service["is_complaint"] = df_service["request"].str.contains("Problem|Beschwerde", case=False, regex=True)
num_complaint_requests = df_service.groupby("customer_id")["is_complaint"].sum().rename("num_complaint_requests")

print(total_tickets.head())
print(total_interactions.head())
print(avg_reply_delay.head())
print(num_complaint_requests.head())
customer_id
0002-ORFBO    1
0003-MKNFE    2
0004-TLHLJ    1
0011-IGKFF    2
0013-EXCHZ    5
Name: total_tickets, dtype: int64
customer_id
0002-ORFBO    3
0003-MKNFE    4
0004-TLHLJ    3
0011-IGKFF    3
0013-EXCHZ    9
Name: total_interactions, dtype: int64
customer_id
0002-ORFBO    19.058148
0003-MKNFE    22.865833
0004-TLHLJ     7.966481
0011-IGKFF    11.924352
0013-EXCHZ    23.818210
Name: avg_reply_delay_hours, dtype: float64
customer_id
0002-ORFBO    1
0003-MKNFE    3
0004-TLHLJ    1
0011-IGKFF    2
0013-EXCHZ    1
Name: num_complaint_requests, dtype: int64
[52]:
dict_features["total_tickets"] = total_tickets
dict_features["total_interactions"] = total_interactions
dict_features["avg_reply_delay"] = avg_reply_delay
dict_features["num_complaint_requests"] = num_complaint_requests
[53]:
dict_features.keys()
[53]:
dict_keys(['num_request_last_1_months', 'num_request_last_3_months', 'num_request_last_6_months', 'num_request_last_12_months', 'avg_num_days_till_solution', 'avg_num_thread_till_solution', 'last_channel', 'channel_most_used_last6mo', 'total_tickets', 'total_interactions', 'avg_reply_delay', 'num_complaint_requests'])

Zusammenführen aller Features

[54]:
# Fasse alle aggregierten Features in einem DataFrame zusammen
df_features = pd.DataFrame(dict_features).reset_index()

print(df_features.shape)
df_features.head()
(6096, 13)
[54]:
customer_id num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution last_channel channel_most_used_last6mo total_tickets total_interactions avg_reply_delay num_complaint_requests
0 0002-ORFBO NaN NaN 1.0 3.0 0.0 1.00 Hotline Hotline 1 3 19.058148 1
1 0003-MKNFE NaN 3.0 4.0 4.0 0.0 1.00 Social Media Email 2 4 22.865833 3
2 0004-TLHLJ NaN 3.0 3.0 3.0 0.0 1.00 Social Media Chat 1 3 7.966481 1
3 0011-IGKFF NaN NaN NaN 1.0 NaN NaN Social Media NaN 2 3 11.924352 2
4 0013-EXCHZ 6.0 9.0 9.0 9.0 1.0 1.25 Email Chat 5 9 23.818210 1
[55]:
df_features.info()
<class 'pandas.DataFrame'>
RangeIndex: 6096 entries, 0 to 6095
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype
---  ------                        --------------  -----
 0   customer_id                   6096 non-null   str
 1   num_request_last_1_months     1365 non-null   float64
 2   num_request_last_3_months     2338 non-null   float64
 3   num_request_last_6_months     3191 non-null   float64
 4   num_request_last_12_months    4198 non-null   float64
 5   avg_num_days_till_solution    5384 non-null   float64
 6   avg_num_thread_till_solution  5384 non-null   float64
 7   last_channel                  6096 non-null   str
 8   channel_most_used_last6mo     3191 non-null   str
 9   total_tickets                 6096 non-null   int64
 10  total_interactions            6096 non-null   int64
 11  avg_reply_delay               6096 non-null   float64
 12  num_complaint_requests        6096 non-null   int64
dtypes: float64(7), int64(3), str(3)
memory usage: 741.0 KB
[56]:
df_features.isnull().sum()
[56]:
customer_id                        0
num_request_last_1_months       4731
num_request_last_3_months       3758
num_request_last_6_months       2905
num_request_last_12_months      1898
avg_num_days_till_solution       712
avg_num_thread_till_solution     712
last_channel                       0
channel_most_used_last6mo       2905
total_tickets                      0
total_interactions                 0
avg_reply_delay                    0
num_complaint_requests             0
dtype: int64

Null-Werte

Leere Werte auffüllen - hierbei darauf achten, dass zwischen verschiedenen Datentypen wieder unterschieden werden muss.

In diesem Fall muss man nicht manuell durch alle Spalten durchgehen, sondern kann sich direkt die dtypes zu Nutze machen.

Lösung durch Schleife (ineffizient) vs. pythonic Lösung mit select_dtypes (effizient)

Straight-forward way kann eine Schleife die Lösung liefern:

for col in df_features.columns: if df_features[col].dtype == 'object': df_features[col].fillna("", inplace=True) else: df_features[col].fillna(0, inplace=True)

Aber effizienter (und „Pythonic“-way) würde man select_dtypes nutzen

select_dtypes basiert auf Vektoroperationen, ist somit meist schneller und übersichtlicher als eine explizite Schleife über alle Spalten

[57]:
# Für numerische Spalten
cols_feature_numeric = df_features.select_dtypes(include=[np.number]).columns
df_features[cols_feature_numeric] = df_features[cols_feature_numeric].fillna(0)

# Für string-Spalten
cols_feature_object = df_features.select_dtypes(include=["object"]).columns
df_features[cols_feature_object] = df_features[cols_feature_object].fillna("")

# Ausgabe des Feature-DataFrames
df_features.head()
[57]:
customer_id num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution last_channel channel_most_used_last6mo total_tickets total_interactions avg_reply_delay num_complaint_requests
0 0002-ORFBO 0.0 0.0 1.0 3.0 0.0 1.00 Hotline Hotline 1 3 19.058148 1
1 0003-MKNFE 0.0 3.0 4.0 4.0 0.0 1.00 Social Media Email 2 4 22.865833 3
2 0004-TLHLJ 0.0 3.0 3.0 3.0 0.0 1.00 Social Media Chat 1 3 7.966481 1
3 0011-IGKFF 0.0 0.0 0.0 1.0 0.0 0.00 Social Media 2 3 11.924352 2
4 0013-EXCHZ 6.0 9.0 9.0 9.0 1.0 1.25 Email Chat 5 9 23.818210 1

Merge die Features mit der originalen Kunden-Tabelle

Beim Merge achten wir darauf, dass der Schlüssel in der Kunden-Tabelle „customerID“ heißt und in features_df „customer_id“

Zudem wollen wir hier einen left-merge, sodass die Originale Kunden-Tabelle auf jeden Fall bestehen bleibt und die Feature-Tabelle „rangemergt“ wird.

[58]:
df_merge = df_onehot_encoded.merge(df_features, left_on="customerID", right_on="customer_id", how="left")
df_merge
[58]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check customer_id num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution last_channel channel_most_used_last6mo total_tickets total_interactions avg_reply_delay num_complaint_requests
0 7590-VHVEG 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 No 0 0 0 0 0 0 0 0 0 0 0 0 0 7590-VHVEG 2.0 2.0 2.0 2.0 0.0 1.0 Hotline Hotline 1.0 2.0 10.169722 0.0
1 5575-GNVDE 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 5575-GNVDE 3.0 3.0 3.0 3.0 0.5 1.0 Hotline Hotline 3.0 8.0 21.380208 1.0
2 3668-QPYBK 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 3668-QPYBK 3.0 6.0 6.0 6.0 1.5 1.5 Chat Hotline 3.0 6.0 19.322130 1.0
3 7795-CFOCW 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 No 0 0 0 0 0 0 0 0 0 0 0 0 0 7795-CFOCW 0.0 0.0 0.0 0.0 0.0 1.0 Email 1.0 2.0 11.750556 1.0
4 9237-HQITU 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 9237-HQITU 2.0 5.0 5.0 5.0 2.0 2.0 Hotline Email 2.0 5.0 28.044167 3.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 6840-RESVB 0 0 1 1 24 1 1 0 1 1 1 1 1 84.80 1990.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 6840-RESVB 0.0 0.0 0.0 3.0 1.0 1.0 Email 1.0 3.0 23.803889 2.0
7028 2234-XADUH 1 0 1 1 72 1 0 1 1 0 1 1 1 103.20 7362.90 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7029 4801-JZAZL 1 0 1 1 11 0 1 0 0 0 0 0 1 29.60 346.45 No 0 0 0 0 0 0 0 0 0 0 0 0 0 4801-JZAZL 0.0 0.0 2.0 2.0 0.0 1.0 Email Email 1.0 2.0 18.954583 2.0
7030 8361-LTMKD 0 0 1 0 4 1 0 0 0 0 0 0 1 74.40 306.60 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 8361-LTMKD 0.0 2.0 2.0 2.0 0.0 1.0 Hotline Hotline 1.0 2.0 10.321667 2.0
7031 3186-AJIEK 0 0 0 0 66 1 1 0 1 1 1 1 1 105.65 6844.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

7032 rows × 43 columns

[59]:
# Entferne den doppelten Spaltennamen (customer_id) oder ordne die Spalten neu
df_merge.drop(columns="customer_id", inplace=True)

# Ausgabe der erweiterten Kunden-Tabelle
df_merge.head()
[59]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution last_channel channel_most_used_last6mo total_tickets total_interactions avg_reply_delay num_complaint_requests
0 7590-VHVEG 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 No 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 2.0 2.0 2.0 0.0 1.0 Hotline Hotline 1.0 2.0 10.169722 0.0
1 5575-GNVDE 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 3.0 3.0 3.0 0.5 1.0 Hotline Hotline 3.0 8.0 21.380208 1.0
2 3668-QPYBK 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 6.0 6.0 6.0 1.5 1.5 Chat Hotline 3.0 6.0 19.322130 1.0
3 7795-CFOCW 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 No 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 Email 1.0 2.0 11.750556 1.0
4 9237-HQITU 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 5.0 5.0 5.0 2.0 2.0 Hotline Email 2.0 5.0 28.044167 3.0
[60]:
df_merge.info()
<class 'pandas.DataFrame'>
RangeIndex: 7032 entries, 0 to 7031
Data columns (total 42 columns):
 #   Column                                   Non-Null Count  Dtype
---  ------                                   --------------  -----
 0   customerID                               7032 non-null   str
 1   gender                                   7032 non-null   int64
 2   SeniorCitizen                            7032 non-null   int64
 3   Partner                                  7032 non-null   int64
 4   Dependents                               7032 non-null   int64
 5   tenure                                   7032 non-null   int64
 6   PhoneService                             7032 non-null   int64
 7   OnlineSecurity                           7032 non-null   int64
 8   OnlineBackup                             7032 non-null   int64
 9   DeviceProtection                         7032 non-null   int64
 10  TechSupport                              7032 non-null   int64
 11  StreamingTV                              7032 non-null   int64
 12  StreamingMovies                          7032 non-null   int64
 13  PaperlessBilling                         7032 non-null   int64
 14  MonthlyCharges                           7032 non-null   float64
 15  TotalCharges                             7032 non-null   float64
 16  Churn                                    7032 non-null   str
 17  MultipleLines_No                         7032 non-null   int64
 18  MultipleLines_No phone service           7032 non-null   int64
 19  MultipleLines_Yes                        7032 non-null   int64
 20  InternetService_DSL                      7032 non-null   int64
 21  InternetService_Fiber optic              7032 non-null   int64
 22  InternetService_No                       7032 non-null   int64
 23  Contract_Month-to-month                  7032 non-null   int64
 24  Contract_One year                        7032 non-null   int64
 25  Contract_Two year                        7032 non-null   int64
 26  PaymentMethod_Bank transfer (automatic)  7032 non-null   int64
 27  PaymentMethod_Credit card (automatic)    7032 non-null   int64
 28  PaymentMethod_Electronic check           7032 non-null   int64
 29  PaymentMethod_Mailed check               7032 non-null   int64
 30  num_request_last_1_months                6087 non-null   float64
 31  num_request_last_3_months                6087 non-null   float64
 32  num_request_last_6_months                6087 non-null   float64
 33  num_request_last_12_months               6087 non-null   float64
 34  avg_num_days_till_solution               6087 non-null   float64
 35  avg_num_thread_till_solution             6087 non-null   float64
 36  last_channel                             6087 non-null   str
 37  channel_most_used_last6mo                6087 non-null   str
 38  total_tickets                            6087 non-null   float64
 39  total_interactions                       6087 non-null   float64
 40  avg_reply_delay                          6087 non-null   float64
 41  num_complaint_requests                   6087 non-null   float64
dtypes: float64(12), int64(26), str(4)
memory usage: 2.4 MB
[61]:
df_merge.isnull().sum()
[61]:
customerID                                   0
gender                                       0
SeniorCitizen                                0
Partner                                      0
Dependents                                   0
tenure                                       0
PhoneService                                 0
OnlineSecurity                               0
OnlineBackup                                 0
DeviceProtection                             0
TechSupport                                  0
StreamingTV                                  0
StreamingMovies                              0
PaperlessBilling                             0
MonthlyCharges                               0
TotalCharges                                 0
Churn                                        0
MultipleLines_No                             0
MultipleLines_No phone service               0
MultipleLines_Yes                            0
InternetService_DSL                          0
InternetService_Fiber optic                  0
InternetService_No                           0
Contract_Month-to-month                      0
Contract_One year                            0
Contract_Two year                            0
PaymentMethod_Bank transfer (automatic)      0
PaymentMethod_Credit card (automatic)        0
PaymentMethod_Electronic check               0
PaymentMethod_Mailed check                   0
num_request_last_1_months                  945
num_request_last_3_months                  945
num_request_last_6_months                  945
num_request_last_12_months                 945
avg_num_days_till_solution                 945
avg_num_thread_till_solution               945
last_channel                               945
channel_most_used_last6mo                  945
total_tickets                              945
total_interactions                         945
avg_reply_delay                            945
num_complaint_requests                     945
dtype: int64
[62]:
df_merge[df_merge["num_request_last_6_months"].isnull()]
[62]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution last_channel channel_most_used_last6mo total_tickets total_interactions avg_reply_delay num_complaint_requests
10 9763-GRSKD 0 0 1 1 13 1 1 0 0 0 0 0 1 49.95 587.45 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
30 3841-NFECX 1 0 1 0 71 1 1 1 1 1 0 0 1 96.35 6766.95 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
31 4929-XIHVW 0 0 1 0 2 1 0 0 1 0 1 1 1 95.50 181.65 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
38 5380-WJKOV 0 0 0 0 34 1 0 1 1 0 1 1 1 106.35 3549.25 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
42 9867-JCZSP 1 0 1 1 17 1 0 0 0 0 0 0 0 20.75 418.25 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7018 2235-DWLJU 1 0 0 0 6 0 0 0 0 0 1 1 1 44.40 263.05 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7019 0871-OPBXW 1 0 0 0 2 1 0 0 0 0 0 0 1 20.05 39.25 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7025 7750-EYXWZ 1 0 0 0 12 0 0 1 1 1 1 1 0 60.65 743.30 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7028 2234-XADUH 1 0 1 1 72 1 0 1 1 0 1 1 1 103.20 7362.90 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
7031 3186-AJIEK 0 0 0 0 66 1 1 0 1 1 1 1 1 105.65 6844.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

945 rows × 42 columns

[63]:
df_service[df_service["customer_id"]=="9763-GRSKD"]
[63]:
ticket_id customer_id time_request time_reply channel request reply solved original_request original_ticket_id reply_delay_hours is_complaint

Den finalen DataFrame auffüllen

[64]:
# Für numerische Spalten
numeric_cols = df_merge.select_dtypes(include=[np.number]).columns
df_merge[numeric_cols] = df_merge[numeric_cols].fillna(0)

# Für string-Spalten
object_cols = df_merge.select_dtypes(include=["object"]).columns
df_merge[object_cols] = df_merge[object_cols].fillna("")
[65]:
df_merge.isnull().sum()
[65]:
customerID                                 0
gender                                     0
SeniorCitizen                              0
Partner                                    0
Dependents                                 0
tenure                                     0
PhoneService                               0
OnlineSecurity                             0
OnlineBackup                               0
DeviceProtection                           0
TechSupport                                0
StreamingTV                                0
StreamingMovies                            0
PaperlessBilling                           0
MonthlyCharges                             0
TotalCharges                               0
Churn                                      0
MultipleLines_No                           0
MultipleLines_No phone service             0
MultipleLines_Yes                          0
InternetService_DSL                        0
InternetService_Fiber optic                0
InternetService_No                         0
Contract_Month-to-month                    0
Contract_One year                          0
Contract_Two year                          0
PaymentMethod_Bank transfer (automatic)    0
PaymentMethod_Credit card (automatic)      0
PaymentMethod_Electronic check             0
PaymentMethod_Mailed check                 0
num_request_last_1_months                  0
num_request_last_3_months                  0
num_request_last_6_months                  0
num_request_last_12_months                 0
avg_num_days_till_solution                 0
avg_num_thread_till_solution               0
last_channel                               0
channel_most_used_last6mo                  0
total_tickets                              0
total_interactions                         0
avg_reply_delay                            0
num_complaint_requests                     0
dtype: int64
[66]:
df_merge.head()
[66]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution last_channel channel_most_used_last6mo total_tickets total_interactions avg_reply_delay num_complaint_requests
0 7590-VHVEG 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 No 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 2.0 2.0 2.0 0.0 1.0 Hotline Hotline 1.0 2.0 10.169722 0.0
1 5575-GNVDE 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 3.0 3.0 3.0 0.5 1.0 Hotline Hotline 3.0 8.0 21.380208 1.0
2 3668-QPYBK 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 6.0 6.0 6.0 1.5 1.5 Chat Hotline 3.0 6.0 19.322130 1.0
3 7795-CFOCW 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 No 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 Email 1.0 2.0 11.750556 1.0
4 9237-HQITU 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 5.0 5.0 5.0 2.0 2.0 Hotline Email 2.0 5.0 28.044167 3.0
[67]:
df_merge.dtypes
[67]:
customerID                                     str
gender                                       int64
SeniorCitizen                                int64
Partner                                      int64
Dependents                                   int64
tenure                                       int64
PhoneService                                 int64
OnlineSecurity                               int64
OnlineBackup                                 int64
DeviceProtection                             int64
TechSupport                                  int64
StreamingTV                                  int64
StreamingMovies                              int64
PaperlessBilling                             int64
MonthlyCharges                             float64
TotalCharges                               float64
Churn                                          str
MultipleLines_No                             int64
MultipleLines_No phone service               int64
MultipleLines_Yes                            int64
InternetService_DSL                          int64
InternetService_Fiber optic                  int64
InternetService_No                           int64
Contract_Month-to-month                      int64
Contract_One year                            int64
Contract_Two year                            int64
PaymentMethod_Bank transfer (automatic)      int64
PaymentMethod_Credit card (automatic)        int64
PaymentMethod_Electronic check               int64
PaymentMethod_Mailed check                   int64
num_request_last_1_months                  float64
num_request_last_3_months                  float64
num_request_last_6_months                  float64
num_request_last_12_months                 float64
avg_num_days_till_solution                 float64
avg_num_thread_till_solution               float64
last_channel                                   str
channel_most_used_last6mo                      str
total_tickets                              float64
total_interactions                         float64
avg_reply_delay                            float64
num_complaint_requests                     float64
dtype: object

Transform categorical columns

[68]:
cols_merge_object = ["last_channel", "channel_most_used_last6mo"]
df_final = pd.get_dummies(df_merge, columns=cols_merge_object, dtype=int)

print(df_final.shape)
df_final
(7032, 52)
[68]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution total_tickets total_interactions avg_reply_delay num_complaint_requests last_channel_ last_channel_Chat last_channel_Email last_channel_Hotline last_channel_On-site last_channel_Social Media channel_most_used_last6mo_ channel_most_used_last6mo_Chat channel_most_used_last6mo_Email channel_most_used_last6mo_Hotline channel_most_used_last6mo_On-site channel_most_used_last6mo_Social Media
0 7590-VHVEG 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 No 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 2.0 2.0 2.0 0.0 1.0 1.0 2.0 10.169722 0.0 0 0 0 1 0 0 0 0 0 1 0 0
1 5575-GNVDE 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 3.0 3.0 3.0 0.5 1.0 3.0 8.0 21.380208 1.0 0 0 0 1 0 0 0 0 0 1 0 0
2 3668-QPYBK 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 6.0 6.0 6.0 1.5 1.5 3.0 6.0 19.322130 1.0 0 1 0 0 0 0 0 0 0 1 0 0
3 7795-CFOCW 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 No 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 2.0 11.750556 1.0 0 0 1 0 0 0 1 0 0 0 0 0
4 9237-HQITU 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 5.0 5.0 5.0 2.0 2.0 2.0 5.0 28.044167 3.0 0 0 0 1 0 0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 6840-RESVB 0 0 1 1 24 1 1 0 1 1 1 1 1 84.80 1990.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 3.0 1.0 1.0 1.0 3.0 23.803889 2.0 0 0 1 0 0 0 1 0 0 0 0 0
7028 2234-XADUH 1 0 1 1 72 1 0 1 1 0 1 1 1 103.20 7362.90 No 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 1 0 0 0 0 0 1 0 0 0 0 0
7029 4801-JZAZL 1 0 1 1 11 0 1 0 0 0 0 0 1 29.60 346.45 No 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 2.0 2.0 0.0 1.0 1.0 2.0 18.954583 2.0 0 0 1 0 0 0 0 0 1 0 0 0
7030 8361-LTMKD 0 0 1 0 4 1 0 0 0 0 0 0 1 74.40 306.60 Yes 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 2.0 2.0 2.0 0.0 1.0 1.0 2.0 10.321667 2.0 0 0 0 1 0 0 0 0 0 1 0 0
7031 3186-AJIEK 0 0 0 0 66 1 1 0 1 1 1 1 1 105.65 6844.50 No 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 1 0 0 0 0 0 1 0 0 0 0 0

7032 rows × 52 columns

Don’t forget the Target Variable!

[69]:
df_final['Churn'] = df_final['Churn'].map({"Yes": 1, "No": 0})
df_final.head()
[69]:
customerID gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution total_tickets total_interactions avg_reply_delay num_complaint_requests last_channel_ last_channel_Chat last_channel_Email last_channel_Hotline last_channel_On-site last_channel_Social Media channel_most_used_last6mo_ channel_most_used_last6mo_Chat channel_most_used_last6mo_Email channel_most_used_last6mo_Hotline channel_most_used_last6mo_On-site channel_most_used_last6mo_Social Media
0 7590-VHVEG 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 2.0 2.0 2.0 0.0 1.0 1.0 2.0 10.169722 0.0 0 0 0 1 0 0 0 0 0 1 0 0
1 5575-GNVDE 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 3.0 3.0 3.0 0.5 1.0 3.0 8.0 21.380208 1.0 0 0 0 1 0 0 0 0 0 1 0 0
2 3668-QPYBK 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 1 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 6.0 6.0 6.0 1.5 1.5 3.0 6.0 19.322130 1.0 0 1 0 0 0 0 0 0 0 1 0 0
3 7795-CFOCW 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 2.0 11.750556 1.0 0 0 1 0 0 0 1 0 0 0 0 0
4 9237-HQITU 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 5.0 5.0 5.0 2.0 2.0 2.0 5.0 28.044167 3.0 0 0 0 1 0 0 0 0 1 0 0 0
[70]:
## Verwerfe ID-columns
if "customerID" in df_final:
    df_final.drop("customerID", axis=1, inplace=True)

df_final
[70]:
gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution total_tickets total_interactions avg_reply_delay num_complaint_requests last_channel_ last_channel_Chat last_channel_Email last_channel_Hotline last_channel_On-site last_channel_Social Media channel_most_used_last6mo_ channel_most_used_last6mo_Chat channel_most_used_last6mo_Email channel_most_used_last6mo_Hotline channel_most_used_last6mo_On-site channel_most_used_last6mo_Social Media
0 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 2.0 2.0 2.0 0.0 1.0 1.0 2.0 10.169722 0.0 0 0 0 1 0 0 0 0 0 1 0 0
1 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 3.0 3.0 3.0 0.5 1.0 3.0 8.0 21.380208 1.0 0 0 0 1 0 0 0 0 0 1 0 0
2 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 1 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 6.0 6.0 6.0 1.5 1.5 3.0 6.0 19.322130 1.0 0 1 0 0 0 0 0 0 0 1 0 0
3 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 2.0 11.750556 1.0 0 0 1 0 0 0 1 0 0 0 0 0
4 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 5.0 5.0 5.0 2.0 2.0 2.0 5.0 28.044167 3.0 0 0 0 1 0 0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 0 0 1 1 24 1 1 0 1 1 1 1 1 84.80 1990.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 3.0 1.0 1.0 1.0 3.0 23.803889 2.0 0 0 1 0 0 0 1 0 0 0 0 0
7028 1 0 1 1 72 1 0 1 1 0 1 1 1 103.20 7362.90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 1 0 0 0 0 0 1 0 0 0 0 0
7029 1 0 1 1 11 0 1 0 0 0 0 0 1 29.60 346.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 2.0 2.0 0.0 1.0 1.0 2.0 18.954583 2.0 0 0 1 0 0 0 0 0 1 0 0 0
7030 0 0 1 0 4 1 0 0 0 0 0 0 1 74.40 306.60 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 2.0 2.0 2.0 0.0 1.0 1.0 2.0 10.321667 2.0 0 0 0 1 0 0 0 0 0 1 0 0
7031 0 0 0 0 66 1 1 0 1 1 1 1 1 105.65 6844.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 1 0 0 0 0 0 1 0 0 0 0 0

7032 rows × 51 columns

Kunden-IDs:

Jetzt, wo die Tabellen gemergt sind, können wir als letzten Schritt vor dem Split des Datensatzes die einzig uebrig gebliebene Kunden-ID Spalte entfernen

[71]:
if "customerID" in df_final:
    df_final.drop("customerID", axis=1, inplace=True)

df_final
[71]:
gender SeniorCitizen Partner Dependents tenure PhoneService OnlineSecurity OnlineBackup DeviceProtection TechSupport StreamingTV StreamingMovies PaperlessBilling MonthlyCharges TotalCharges Churn MultipleLines_No MultipleLines_No phone service MultipleLines_Yes InternetService_DSL InternetService_Fiber optic InternetService_No Contract_Month-to-month Contract_One year Contract_Two year PaymentMethod_Bank transfer (automatic) PaymentMethod_Credit card (automatic) PaymentMethod_Electronic check PaymentMethod_Mailed check num_request_last_1_months num_request_last_3_months num_request_last_6_months num_request_last_12_months avg_num_days_till_solution avg_num_thread_till_solution total_tickets total_interactions avg_reply_delay num_complaint_requests last_channel_ last_channel_Chat last_channel_Email last_channel_Hotline last_channel_On-site last_channel_Social Media channel_most_used_last6mo_ channel_most_used_last6mo_Chat channel_most_used_last6mo_Email channel_most_used_last6mo_Hotline channel_most_used_last6mo_On-site channel_most_used_last6mo_Social Media
0 1 0 1 0 1 0 0 1 0 0 0 0 1 29.85 29.85 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 2.0 2.0 2.0 0.0 1.0 1.0 2.0 10.169722 0.0 0 0 0 1 0 0 0 0 0 1 0 0
1 0 0 0 0 34 1 1 0 1 0 0 0 0 56.95 1889.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 3.0 3.0 3.0 0.5 1.0 3.0 8.0 21.380208 1.0 0 0 0 1 0 0 0 0 0 1 0 0
2 0 0 0 0 2 1 1 1 0 0 0 0 1 53.85 108.15 1 0 0 0 0 0 0 0 0 0 0 0 0 0 3.0 6.0 6.0 6.0 1.5 1.5 3.0 6.0 19.322130 1.0 0 1 0 0 0 0 0 0 0 1 0 0
3 0 0 0 0 45 0 1 0 1 1 0 0 0 42.30 1840.75 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 2.0 11.750556 1.0 0 0 1 0 0 0 1 0 0 0 0 0
4 1 0 0 0 2 1 0 0 0 0 0 0 1 70.70 151.65 1 0 0 0 0 0 0 0 0 0 0 0 0 0 2.0 5.0 5.0 5.0 2.0 2.0 2.0 5.0 28.044167 3.0 0 0 0 1 0 0 0 0 1 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7027 0 0 1 1 24 1 1 0 1 1 1 1 1 84.80 1990.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 3.0 1.0 1.0 1.0 3.0 23.803889 2.0 0 0 1 0 0 0 1 0 0 0 0 0
7028 1 0 1 1 72 1 0 1 1 0 1 1 1 103.20 7362.90 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 1 0 0 0 0 0 1 0 0 0 0 0
7029 1 0 1 1 11 0 1 0 0 0 0 0 1 29.60 346.45 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 2.0 2.0 0.0 1.0 1.0 2.0 18.954583 2.0 0 0 1 0 0 0 0 0 1 0 0 0
7030 0 0 1 0 4 1 0 0 0 0 0 0 1 74.40 306.60 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 2.0 2.0 2.0 0.0 1.0 1.0 2.0 10.321667 2.0 0 0 0 1 0 0 0 0 0 1 0 0
7031 0 0 0 0 66 1 1 0 1 1 1 1 1 105.65 6844.50 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 1 0 0 0 0 0 1 0 0 0 0 0

7032 rows × 51 columns

Split the Data to x_train, x_test, y_train, y_test

[72]:
# Split X and y
X = df_final.drop(columns = ['Churn'])
y = df_final['Churn'].values
[73]:
# Splitting the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, random_state = 42, stratify=y)

Normalization

How to scale/normalize the numerical columns?

[74]:
cols_ori_numeric
[74]:
['tenure', 'MonthlyCharges', 'TotalCharges']
[75]:
cols_feature_numeric
[75]:
Index(['num_request_last_1_months', 'num_request_last_3_months',
       'num_request_last_6_months', 'num_request_last_12_months',
       'avg_num_days_till_solution', 'avg_num_thread_till_solution',
       'total_tickets', 'total_interactions', 'avg_reply_delay',
       'num_complaint_requests'],
      dtype='str')
[76]:
numeric_final_cols = cols_ori_numeric + cols_feature_numeric.tolist()
numeric_final_cols
[76]:
['tenure',
 'MonthlyCharges',
 'TotalCharges',
 'num_request_last_1_months',
 'num_request_last_3_months',
 'num_request_last_6_months',
 'num_request_last_12_months',
 'avg_num_days_till_solution',
 'avg_num_thread_till_solution',
 'total_tickets',
 'total_interactions',
 'avg_reply_delay',
 'num_complaint_requests']
[77]:
def distplot(feature, frame, color="r"):
    plt.figure(figsize=(8,3))
    plt.title("Verteilung für {}".format(feature))
    ax = sns.distplot(frame[feature], color= color)
[78]:
for feat in numeric_final_cols:
    distplot(feat, df_final)
../_images/4gen_ai_customer-churn-2_112_0.png
../_images/4gen_ai_customer-churn-2_112_1.png
../_images/4gen_ai_customer-churn-2_112_2.png
../_images/4gen_ai_customer-churn-2_112_3.png
../_images/4gen_ai_customer-churn-2_112_4.png
../_images/4gen_ai_customer-churn-2_112_5.png
../_images/4gen_ai_customer-churn-2_112_6.png
../_images/4gen_ai_customer-churn-2_112_7.png
../_images/4gen_ai_customer-churn-2_112_8.png
../_images/4gen_ai_customer-churn-2_112_9.png
../_images/4gen_ai_customer-churn-2_112_10.png
../_images/4gen_ai_customer-churn-2_112_11.png
../_images/4gen_ai_customer-churn-2_112_12.png

Überlegung: Welchen Scaler sollte ich nutzen?

Da die numerischen Features über verschiedene Wertebereiche (value ranges) verteilt sind, nutzen wir hier den Standard Scalar, um sie alle auf den gleichen Bereich runterzuskalieren.

Normalisiere die numerischen Attribute

[79]:
df_std = pd.DataFrame(StandardScaler().fit_transform(df_final[numeric_final_cols].astype("float64")),
                       columns=numeric_final_cols)

for feat in numeric_final_cols:
    distplot(feat, df_std, color="c")
../_images/4gen_ai_customer-churn-2_115_0.png
../_images/4gen_ai_customer-churn-2_115_1.png
../_images/4gen_ai_customer-churn-2_115_2.png
../_images/4gen_ai_customer-churn-2_115_3.png
../_images/4gen_ai_customer-churn-2_115_4.png
../_images/4gen_ai_customer-churn-2_115_5.png
../_images/4gen_ai_customer-churn-2_115_6.png
../_images/4gen_ai_customer-churn-2_115_7.png
../_images/4gen_ai_customer-churn-2_115_8.png
../_images/4gen_ai_customer-churn-2_115_9.png
../_images/4gen_ai_customer-churn-2_115_10.png
../_images/4gen_ai_customer-churn-2_115_11.png
../_images/4gen_ai_customer-churn-2_115_12.png
[80]:
scaler= StandardScaler()

X_train[numeric_final_cols] = scaler.fit_transform(X_train[numeric_final_cols])
X_test[numeric_final_cols] = scaler.transform(X_test[numeric_final_cols])

Modellieren

Es handelt sich hierbei um einen Datensatz mit klar definierten Labels (Churn: Yes oder No), die eine Klassifikationsproblematik beschreiben.

Dazu werden wir nun mehrere Klassifikationsmodelle testen.

K-Nearest Neighbor Classifier

Der K-Nearest-Neighbor Classifier ist eine nichtparametrische Methode zur Schätzung von Wahrscheinlichkeitsdichtefunktionen, mehr Details: https://de.wikipedia.org/wiki/N%C3%A4chste-Nachbarn-Klassifikation

Hier nutzen wir die kNN-Implemtation von scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

[81]:
# Initiere das knn-Modell
knn_model = KNeighborsClassifier(n_neighbors = 11)
[82]:
# Trainiere das Modell mit unseren Trainingsdaten
knn_model.fit(X_train,y_train)
[82]:
KNeighborsClassifier(n_neighbors=11)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[83]:
# die "score"-Funktion scoret direkt die predicted Resultate vom X_test gegen die wahren y-Werte und gibt die Accuracy zurück
accuracy_knn = knn_model.score(X_test, y_test)
print("KNN accuracy:",accuracy_knn)
print(knn_model.score(X_test,y_test))
KNN accuracy: 0.7663507109004739
0.7663507109004739
[84]:
# Alternativ kann man direkt mit dem Modell auf den X_test (Testdatensatz) vorhersagen und diese Vorhersagen gegen die wahren Werte des Testdatensatzes mit sklearn evaluieren
y_pred = knn_model.predict(X_test)

prec = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)

print(f"Precision for class 0: {prec[0]}, precision for class 1: {prec[1]}")
print(f"Recall for class 0: {recall[0]}, recall for class 1: {recall[1]}")
Precision for class 0: 0.8102232667450059, precision for class 1: 0.5833333333333334
Recall for class 0: 0.8902517753389283, recall for class 1: 0.42424242424242425
[85]:
prec = precision_score(y_test, y_pred, pos_label=1)
print(prec)
recall = recall_score(y_test, y_pred, pos_label=1)
print(recall)
0.5833333333333334
0.42424242424242425
[86]:
# Zudem gibt es den "Classification Report", der direkt mehrere Metriken Resultate ausgibt

print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.81      0.89      0.85      1549
           1       0.58      0.42      0.49       561

    accuracy                           0.77      2110
   macro avg       0.70      0.66      0.67      2110
weighted avg       0.75      0.77      0.75      2110

Logistic Regression Classifier

Der Logistic Regression Classifier basiert auf Linear Regression, nur gibt sie am Ende eine

[87]:
lr_model = LogisticRegression()
lr_model.fit(X_train,y_train)
accuracy_lr = lr_model.score(X_test,y_test)
print("Logistic Regression accuracy is :",accuracy_lr)
Logistic Regression accuracy is : 0.7943127962085308
[88]:
# Make predictions
lr_pred= lr_model.predict(X_test)
report = classification_report(y_test,lr_pred)
print(report)
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1549
           1       0.64      0.51      0.57       561

    accuracy                           0.79      2110
   macro avg       0.74      0.70      0.72      2110
weighted avg       0.78      0.79      0.79      2110

[89]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, lr_pred),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("LOGISTIC REGRESSION CONFUSION MATRIX",fontsize=14)
plt.show()
../_images/4gen_ai_customer-churn-2_128_0.png

Decision Tree Classifier

[90]:
dt_model = DecisionTreeClassifier()
dt_model.fit(X_train,y_train)
y_pred = dt_model.predict(X_test)
accuracy_dt = dt_model.score(X_test, y_test)
print("Decision Tree accuracy is :",accuracy_dt)
Decision Tree accuracy is : 0.7161137440758294
[91]:
print(classification_report(y_test, y_pred))
              precision    recall  f1-score   support

           0       0.81      0.80      0.80      1549
           1       0.47      0.50      0.48       561

    accuracy                           0.72      2110
   macro avg       0.64      0.65      0.64      2110
weighted avg       0.72      0.72      0.72      2110

AdaBoost Classifier

[92]:
a_model = AdaBoostClassifier()
a_model.fit(X_train,y_train)
a_preds = a_model.predict(X_test)
print("AdaBoost Classifier accuracy")
metrics.accuracy_score(y_test, a_preds)
AdaBoost Classifier accuracy
[92]:
0.7924170616113744
[93]:
print(classification_report(y_test, a_preds))
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1549
           1       0.64      0.49      0.56       561

    accuracy                           0.79      2110
   macro avg       0.74      0.70      0.71      2110
weighted avg       0.78      0.79      0.78      2110

[94]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, a_preds),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("AdaBoost Classifier Confusion Matrix",fontsize=14)
plt.show()
../_images/4gen_ai_customer-churn-2_135_0.png

Gradient Boosting Classifier

[95]:
gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)
print("Gradient Boosting Classifier", accuracy_score(y_test, gb_pred))
Gradient Boosting Classifier 0.790521327014218
[96]:
print(classification_report(y_test, gb_pred))
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1549
           1       0.64      0.49      0.56       561

    accuracy                           0.79      2110
   macro avg       0.73      0.70      0.71      2110
weighted avg       0.78      0.79      0.78      2110

[97]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, gb_pred),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("Gradient Boosting Classifier Confusion Matrix",fontsize=14)
plt.show()
../_images/4gen_ai_customer-churn-2_139_0.png

Voting Classifier

Der Voting Classifier ist ein Mixture Modell, der mehrere Modelle vereint, die alle ihre Predictions machen und dann über Majority voting entschieden wird, welche Prediction Output rausgegeben wird.

[98]:
from sklearn.ensemble import VotingClassifier
clf1 = GradientBoostingClassifier()
clf2 = LogisticRegression()
clf3 = AdaBoostClassifier()
eclf1 = VotingClassifier(estimators=[("gbc", clf1), ("lr", clf2), ("abc", clf3)], voting="soft")
eclf1.fit(X_train, y_train)
predictions = eclf1.predict(X_test)
print("Final Accuracy Score ")
print(accuracy_score(y_test, predictions))
Final Accuracy Score
0.7933649289099526
[99]:
print(classification_report(y_test, predictions))
              precision    recall  f1-score   support

           0       0.83      0.90      0.86      1549
           1       0.64      0.50      0.56       561

    accuracy                           0.79      2110
   macro avg       0.74      0.70      0.71      2110
weighted avg       0.78      0.79      0.78      2110

[100]:
plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, predictions),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("FINAL CONFUSION MATRIX",fontsize=14)
plt.show()
../_images/4gen_ai_customer-churn-2_143_0.png

Cross Validation Example!

Anhand eines einfachen Decision Trees

[101]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import KFold, cross_validate
import numpy as np
[102]:
# Annahme: X und y beinhalten jeweils deine Feature-Matrix und Zielvariable (Churn als 0/1)

dt_model = DecisionTreeClassifier()
dt_model
[102]:
DecisionTreeClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Wähle k=10 für 10-fold Cross-Validation

[103]:
k = 10
kf = KFold(n_splits=k, shuffle=True, random_state=42)
kf
[103]:
KFold(n_splits=10, random_state=42, shuffle=True)

Verwende cross_validate, um sowohl Trainings- als auch Validierungsscores zu erhalten

[104]:
cv_results = cross_validate(dt_model, X, y, cv=kf, return_train_score=True)
cv_results
[104]:
{'fit_time': array([0.03874516, 0.04276395, 0.03619814, 0.03640604, 0.0356729 ,
        0.03642106, 0.03842902, 0.03649712, 0.036309  , 0.03726411]),
 'score_time': array([0.00107479, 0.00114584, 0.00121593, 0.00115991, 0.00108218,
        0.00104594, 0.00105691, 0.00105095, 0.0015471 , 0.00119972]),
 'test_score': array([0.73579545, 0.68892045, 0.7254623 , 0.74253201, 0.70981508,
        0.70981508, 0.7254623 , 0.68421053, 0.69274538, 0.7254623 ]),
 'train_score': array([0.99968394, 0.99984197, 0.999842  , 0.99968399, 0.999842  ,
        0.99968399, 0.99968399, 0.99968399, 0.99968399, 0.99968399])}

Logge die Ergebnisse jeder einzelnen Fold

[105]:
print("Trainings-Scores pro Fold:")
print(cv_results["train_score"])
Trainings-Scores pro Fold:
[0.99968394 0.99984197 0.999842   0.99968399 0.999842   0.99968399
 0.99968399 0.99968399 0.99968399 0.99968399]
[106]:
print("Validierungs-Scores pro Fold:")
print(cv_results["test_score"])
Validierungs-Scores pro Fold:
[0.73579545 0.68892045 0.7254623  0.74253201 0.70981508 0.70981508
 0.7254623  0.68421053 0.69274538 0.7254623 ]
[107]:
print("Durchschnittlicher Validierungs-Score:")
print(np.mean(cv_results["test_score"]))
Durchschnittlicher Validierungs-Score:
0.7140220887753783

Next Step

Falls es sich hierbei um ein sehr einfaches Modell handelt ohne jegliche Parameter (z.B. linear Regression oder logistic Regression), könnte man nun das Modell mit dem besten Validierungs-Score extrahieren.

Cross-Validation

teilt die Daten auf und trainiert/evaluiert das Modell mehrfach (k-fach).

Komplexere Modelle?

Wenn es sich um ein komplexeres Modell mit Hyperparameter handelt, kann sie helfen für das jeweilge Set an Hyperparameter eine robuste Performance-Metrik zu erhalten – sie ändert aber nicht selbst die Hyperparameter.

Hyperparameter-Tuning (z. B. mit GridSearchCV)

Hyperparameter-Tuning (z. B. mit GridSearchCV) verwendet CV (Cross-Validation), um verschiedene Hyperparameter-Kombinationen zu vergleichen, und wählt so die optimale Konfiguration.

[108]:
# Beispiel anhand Decision Tree

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
[109]:
# Definiere den Parameterbereich
param_grid = {
    "max_depth": [None, 3, 5, 10, 20], # default=None
    "min_samples_split": [2, 5, 10]  # default=2, notwendige Anzahl an data points in einem Knoten, fürs weitere Teilen dieses Knoten
}
param_grid
[109]:
{'max_depth': [None, 3, 5, 10, 20], 'min_samples_split': [2, 5, 10]}
[110]:
# GridSearchCV kombiniert CV mit dem Hyperparameter-Tuning
grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid, cv=10, return_train_score=True)
grid_search.fit(X, y)
[110]:
GridSearchCV(cv=10, estimator=DecisionTreeClassifier(),
             param_grid={'max_depth': [None, 3, 5, 10, 20],
                         'min_samples_split': [2, 5, 10]},
             return_train_score=True)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
[111]:
# Beste Parameter und entsprechender Score
print("Beste Parameter:", grid_search.best_params_)
print("Bester CV-Score:", grid_search.best_score_)
Beste Parameter: {'max_depth': 5, 'min_samples_split': 10}
Bester CV-Score: 0.7912374321091427
[112]:
# Das finale Modell (trainiert auf den gesamten Daten mit den besten Parametern)
best_model = grid_search.best_estimator_
[113]:
y_pred_best = best_model.predict(X_test)


acc_best = accuracy_score(y_test, y_pred_best)
print(acc_best)

prec_best = precision_score(y_test, y_pred_best, pos_label=1)
print(prec_best)

recall_best = recall_score(y_test, y_pred_best, pos_label=1)
print(recall_best)
0.7341232227488151
0.0
0.0
[114]:
1 in y_test
[114]:
True
[115]:
1 in y_pred
[115]:
True
[116]:
y_test
[116]:
array([0, 0, 0, ..., 0, 0, 0], shape=(2110,))
[117]:
y_pred
[117]:
array([0, 0, 0, ..., 0, 0, 0], shape=(2110,))

Accuracy wurde genutzt - was wenn wir precision nutzen wollen?

[118]:
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = {"max_depth": [None, 3, 5, 10, 20],
              "min_samples_split": [2, 5, 10]}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid,
                           cv=10, scoring="precision", refit=True)
grid_search.fit(X, y)

print("Beste Parameter:", grid_search.best_params_)
print("Bester CV-Precision-Score:", grid_search.best_score_)
best_model = grid_search.best_estimator_
Beste Parameter: {'max_depth': 5, 'min_samples_split': 10}
Bester CV-Precision-Score: 0.6664281647925435
[119]:
y_pred_best = best_model.predict(X_test)


acc_best = accuracy_score(y_test, y_pred_best)
print(acc_best)

prec_best = precision_score(y_test, y_pred_best, pos_label=1)
print(prec_best)

recall_best = recall_score(y_test, y_pred_best, pos_label=1)
print(recall_best)
0.7341232227488151
0.0
0.0

Können wir mehrere Metriken gleichzeitig überwachen?

[120]:
from sklearn.metrics import make_scorer, precision_score, recall_score
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier

param_grid = {"max_depth": [None, 3, 5, 10, 20],
              "min_samples_split": [2, 5, 10]}

scoring = {
    "precision": make_scorer(precision_score, pos_label=1),
    "recall": make_scorer(recall_score, pos_label=1)
}

grid_search = GridSearchCV(DecisionTreeClassifier(), param_grid,
                           cv=10, scoring=scoring, refit="precision")  # refit='precision' lässt das finale Modell auf den besten Precision-Score ausgewählen.
grid_search.fit(X, y)

# Es werden alle Scores zurückgegeben:
cv_results = grid_search.cv_results_
print("Beste Parameter:", grid_search.best_params_)
print("Bester CV-Precision-Score:", grid_search.best_score_)

# Die Ergebnisse aller Folds (z.B. Precision und Recall) kannst du so ausgeben:
for mean_score, params in zip(cv_results["mean_test_precision"], cv_results["params"]):
    print(f"Params: {params} - Mean Precision: {mean_score:.4f}")

best_model = grid_search.best_estimator_
Beste Parameter: {'max_depth': 5, 'min_samples_split': 10}
Bester CV-Precision-Score: 0.6670314466458254
Params: {'max_depth': None, 'min_samples_split': 2} - Mean Precision: 0.4795
Params: {'max_depth': None, 'min_samples_split': 5} - Mean Precision: 0.4939
Params: {'max_depth': None, 'min_samples_split': 10} - Mean Precision: 0.5079
Params: {'max_depth': 3, 'min_samples_split': 2} - Mean Precision: 0.6475
Params: {'max_depth': 3, 'min_samples_split': 5} - Mean Precision: 0.6475
Params: {'max_depth': 3, 'min_samples_split': 10} - Mean Precision: 0.6475
Params: {'max_depth': 5, 'min_samples_split': 2} - Mean Precision: 0.6660
Params: {'max_depth': 5, 'min_samples_split': 5} - Mean Precision: 0.6656
Params: {'max_depth': 5, 'min_samples_split': 10} - Mean Precision: 0.6670
Params: {'max_depth': 10, 'min_samples_split': 2} - Mean Precision: 0.5477
Params: {'max_depth': 10, 'min_samples_split': 5} - Mean Precision: 0.5497
Params: {'max_depth': 10, 'min_samples_split': 10} - Mean Precision: 0.5505
Params: {'max_depth': 20, 'min_samples_split': 2} - Mean Precision: 0.4877
Params: {'max_depth': 20, 'min_samples_split': 5} - Mean Precision: 0.4937
Params: {'max_depth': 20, 'min_samples_split': 10} - Mean Precision: 0.5072
[121]:
y_pred_best = best_model.predict(X_test)


acc_best = accuracy_score(y_test, y_pred_best)
print(acc_best)

prec_best = precision_score(y_test, y_pred_best, pos_label=1)
print(prec_best)

recall_best = recall_score(y_test, y_pred_best, pos_label=1)
print(recall_best)
0.7341232227488151
0.0
0.0

Vorhergesagte Wahrscheinlichkeit (anstatt der zugeteilten Klasse)

Viele Klassifikationsmodelle in scikit-learn bieten die Methode predict_proba(), mit der wir die Wahrscheinlichkeit für jeden Datenpunkt und jede Klasse abrufen können.

Bei einem binären Klassifikationsmodell liefert predict_proba() ein Array der Form (n_samples, 2), wobei die zweite Spalte (Index 1) die Wahrscheinlichkeit für Klasse 1 enthält.

[122]:
# Hole die Wahrscheinlichkeiten für den Testdatensatz
y_prob = best_model.predict_proba(X_test)
y_prob
[122]:
array([[0.65269461, 0.34730539],
       [0.65269461, 0.34730539],
       [0.65269461, 0.34730539],
       ...,
       [0.65269461, 0.34730539],
       [0.65269461, 0.34730539],
       [0.65269461, 0.34730539]], shape=(2110, 2))
[123]:
# Extrahiere die Wahrscheinlichkeit für Klasse 1
y_prob_class1 = y_prob[:, 1]
y_prob_class1
[123]:
array([0.34730539, 0.34730539, 0.34730539, ..., 0.34730539, 0.34730539,
       0.34730539], shape=(2110,))
[124]:
# Definiere einen eigenen Schwellenwert (Threshold)
threshold = 0.3  # Beispiel: 30% statt des üblichen 0.5

# Wende den Threshold an: Werte >= threshold werden als 1 klassifiziert, ansonsten als 0.
y_pred_custom = (y_prob_class1 >= threshold).astype(int)
[125]:
# Evaluation der angepassten Vorhersagen:
from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Accuracy:", accuracy_score(y_test, y_pred_custom))
print("Precision:", precision_score(y_test, y_pred_custom, pos_label=1))
print("Recall:", recall_score(y_test, y_pred_custom, pos_label=1))
Accuracy: 0.26587677725118486
Precision: 0.26587677725118486
Recall: 1.0