Praxisbeispiel - Telko Kundenabgang-Vorhersage (Telco Customer Churn Prediction) bei Vertragskunden¶

Die Vorhersage der Kundenabwanderung (bei Vertragssituationen) sagt die Wahrscheinlichkeit voraus, dass Kunden die Produkte oder Dienstleistungen eines Unternehmens kündigen. In den meisten Fällen sind Unternehmen mit Stammkunden oder Kunden im Abonnement bestrebt, ihren Kundenstamm zu erhalten. Daher ist es wichtig, die Kunden zu verfolgen, die ihr Abonnement kündigen, und diejenigen, die den Dienst weiter nutzen. Dieser Ansatz setzt voraus, dass das Unternehmen das Verhalten seiner Kunden kennt und versteht und die Eigenschaften, die zu dem Risiko führen, dass der Kunde das Unternehmen verlässt.

Szenario¶

Annahme: Wir sind ein Telekommunikations-Unternehmen, das über historische Daten darüber verfügt, wie seine Kunden mit seinen Dienstleistungen interagiert haben. Das Unternehmen möchte wissen, wie hoch die Wahrscheinlichkeit ist, dass Kunden abwandern, damit es gezielte Marketingkampagnen starten kann.

Datensatz¶

Sage Kundenverhalten voraus, um die Kunden zu binden.

Ihr könnt alle relevanten Kundendaten analysieren und gezielte Kundenbindungsprogramme entwickeln.

https://www.kaggle.com/datasets/blastchar/telco-customer-churn

[1]:

import sys
# !{sys.executable} -m pip install kagglehub
# !{sys.executable} -m pip install @
# !{sys.executable} -m pip install missingno

[2]:

import kagglehub

[3]:

import pandas as pd
import numpy as np
import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings

pd.set_option("display.max_columns", None)
warnings.filterwarnings("ignore")

[4]:

from sklearn.preprocessing import StandardScaler, LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

from sklearn import metrics
from sklearn.metrics import roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

# from xgboost import XGBClassifier
# from catboost import CatBoostClassifier

[5]:

# Download data from Kaggle
path = kagglehub.dataset_download("blastchar/telco-customer-churn")

print("Path to files:", path)

Path to files: /Users/veit/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1

[6]:

path_file = "~/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1/WA_Fn-UseC_-Telco-Customer-Churn.csv"
print(path_file)

~/.cache/kagglehub/datasets/blastchar/telco-customer-churn/versions/1/WA_Fn-UseC_-Telco-Customer-Churn.csv

Datensatz¶

[7]:

df = pd.read_csv(path_file)
df

[7]:

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.5	No
2	3668-QPYBK	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7038	6840-RESVB	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.5	No
7039	2234-XADUH	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.9	No
7040	4801-JZAZL	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	No
7041	8361-LTMKD	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.6	Yes
7042	3186-AJIEK	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.5	No

7043 rows × 21 columns

Der Datensatz enthält Informationen über:

ob der Kunde innerhalb des letzten Monats gekündigt hat - Spalte „Churn“
Dienste, für die sich jeder Kunde angemeldet hat - „PhoneService“, „MultipleLines“, „InternetService“, „OnlineSecurity“, „OnlineBackup“, „DeviceProtection“, „TechSupport“ sowie „StreamingTV“ und „StreamingMovies“.
Informationen zum Kundenkonto - „tenure“ wie lange der Kunde bereits Kunde ist, „Contract“ (Vertrag), „PaperlessBilling“ (papierlose Abrechnung), „PaymentMethod“ (Zahlungsmethode), „MonthlyCharges“ (monatliche Gebühren) und „TotalCharges“ (Gesamtgebühren).
Demografische Informationen über Kunden - „gender“ (Geschlecht), „SeniorCitizen“ (Alter) und ob sie „Partner“ (Partner) und „Dependents“ (Familienangehörige) haben.

[8]:

print("shape: ", df.shape)
print()

print(df.info())

shape:  (7043, 21)

<class 'pandas.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   customerID        7043 non-null   str
 1   gender            7043 non-null   str
 2   SeniorCitizen     7043 non-null   int64
 3   Partner           7043 non-null   str
 4   Dependents        7043 non-null   str
 5   tenure            7043 non-null   int64
 6   PhoneService      7043 non-null   str
 7   MultipleLines     7043 non-null   str
 8   InternetService   7043 non-null   str
 9   OnlineSecurity    7043 non-null   str
 10  OnlineBackup      7043 non-null   str
 11  DeviceProtection  7043 non-null   str
 12  TechSupport       7043 non-null   str
 13  StreamingTV       7043 non-null   str
 14  StreamingMovies   7043 non-null   str
 15  Contract          7043 non-null   str
 16  PaperlessBilling  7043 non-null   str
 17  PaymentMethod     7043 non-null   str
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7043 non-null   str
 20  Churn             7043 non-null   str
dtypes: float64(1), int64(2), str(18)
memory usage: 1.9 MB
None

ACHTUNG!!¶

Wir sehen, dass „TotalCharges“ hier als object auftaucht, erwarten hier aber einen numerischen Typ. Das muss geändert werden - wir zeigen euch aber erst, was passiert, wenn wir es nicht ändern.

[9]:

df.columns.values

[9]:

<ArrowStringArray>
[      'customerID',           'gender',    'SeniorCitizen',
          'Partner',       'Dependents',           'tenure',
     'PhoneService',    'MultipleLines',  'InternetService',
   'OnlineSecurity',     'OnlineBackup', 'DeviceProtection',
      'TechSupport',      'StreamingTV',  'StreamingMovies',
         'Contract', 'PaperlessBilling',    'PaymentMethod',
   'MonthlyCharges',     'TotalCharges',            'Churn']
Length: 21, dtype: str

[10]:

df.dtypes

[10]:

customerID              str
gender                  str
SeniorCitizen         int64
Partner                 str
Dependents              str
tenure                int64
PhoneService            str
MultipleLines           str
InternetService         str
OnlineSecurity          str
OnlineBackup            str
DeviceProtection        str
TechSupport             str
StreamingTV             str
StreamingMovies         str
Contract                str
PaperlessBilling        str
PaymentMethod           str
MonthlyCharges      float64
TotalCharges            str
Churn                   str
dtype: object

Ermittle fehlende Werte und visualisiere sie¶

[11]:

## wir kannten schon die isnull() Funktion in pandas

for col in df.columns:
    print(col, df[col].isnull().sum())

customerID 0
gender 0
SeniorCitizen 0
Partner 0
Dependents 0
tenure 0
PhoneService 0
MultipleLines 0
InternetService 0
OnlineSecurity 0
OnlineBackup 0
DeviceProtection 0
TechSupport 0
StreamingTV 0
StreamingMovies 0
Contract 0
PaperlessBilling 0
PaymentMethod 0
MonthlyCharges 0
TotalCharges 0
Churn 0

[12]:

# Zudem könnte man sich die fehlende Werte auch in einer Matrix anzeigen lassen
# (nur haben wir hier keine fehlenden Werte)

msno.matrix(df);

../_images/1intro_customer-churn_18_0.png

Daten Manipulation / Bereinigung¶

[13]:

df["TotalCharges"] = pd.to_numeric(df.TotalCharges, errors="coerce")
# print(df.dtypes)
print(df.info())

<class 'pandas.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype
---  ------            --------------  -----
 0   customerID        7043 non-null   str
 1   gender            7043 non-null   str
 2   SeniorCitizen     7043 non-null   int64
 3   Partner           7043 non-null   str
 4   Dependents        7043 non-null   str
 5   tenure            7043 non-null   int64
 6   PhoneService      7043 non-null   str
 7   MultipleLines     7043 non-null   str
 8   InternetService   7043 non-null   str
 9   OnlineSecurity    7043 non-null   str
 10  OnlineBackup      7043 non-null   str
 11  DeviceProtection  7043 non-null   str
 12  TechSupport       7043 non-null   str
 13  StreamingTV       7043 non-null   str
 14  StreamingMovies   7043 non-null   str
 15  Contract          7043 non-null   str
 16  PaperlessBilling  7043 non-null   str
 17  PaymentMethod     7043 non-null   str
 18  MonthlyCharges    7043 non-null   float64
 19  TotalCharges      7032 non-null   float64
 20  Churn             7043 non-null   str
dtypes: float64(2), int64(2), str(17)
memory usage: 1.8 MB
None

[14]:

print(df.isnull().sum())

customerID           0
gender               0
SeniorCitizen        0
Partner              0
Dependents           0
tenure               0
PhoneService         0
MultipleLines        0
InternetService      0
OnlineSecurity       0
OnlineBackup         0
DeviceProtection     0
TechSupport          0
StreamingTV          0
StreamingMovies      0
Contract             0
PaperlessBilling     0
PaymentMethod        0
MonthlyCharges       0
TotalCharges        11
Churn                0
dtype: int64

[15]:

# Visualisierung

msno.matrix(df);

../_images/1intro_customer-churn_22_0.png

[16]:

df[np.isnan(df["TotalCharges"])]

[16]:

	customerID	gender	Partner	Dependents	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
488	4472-LVYGI	Female	Yes	Yes	No	No phone service	DSL	Yes	No	Yes	Yes	Yes	No	Two year	Yes	Bank transfer (automatic)	52.55	NaN	No
753	3115-CZMZD	Male	No	Yes	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.25	NaN	No
936	5709-LVOEQ	Female	Yes	Yes	Yes	No	DSL	Yes	Yes	Yes	No	Yes	Yes	Two year	No	Mailed check	80.85	NaN	No
1082	4367-NUYAO	Male	Yes	Yes	Yes	Yes	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.75	NaN	No
1340	1371-DWPAZ	Female	Yes	Yes	No	No phone service	DSL	Yes	Yes	Yes	Yes	Yes	No	Two year	No	Credit card (automatic)	56.05	NaN	No
3331	7644-OMVMY	Male	Yes	Yes	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	19.85	NaN	No
3826	3213-VVOLG	Male	Yes	Yes	Yes	Yes	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	25.35	NaN	No
4380	2520-SGTTA	Female	Yes	Yes	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	Two year	No	Mailed check	20.00	NaN	No
5218	2923-ARZLG	Male	Yes	Yes	Yes	No	No	No internet service	No internet service	No internet service	No internet service	No internet service	No internet service	One year	Yes	Mailed check	19.70	NaN	No
6670	4075-WKNIU	Female	Yes	Yes	Yes	Yes	DSL	No	Yes	Yes	Yes	Yes	No	Two year	No	Mailed check	73.35	NaN	No
6754	2775-SEFEE	Male	No	Yes	Yes	Yes	DSL	Yes	Yes	No	Yes	No	No	Two year	Yes	Bank transfer (automatic)	61.90	NaN	No

Wir beobachten, dass die Spalte „tenure“ 0 ist bei diesen Einträgen, wo „TotalCharges“ NaN ist, obwohl die Spalte MonthlyCharges nicht leer ist.

Lass uns die Spalte „tenure“ näher anschauen und checken, ob es noch andere 0-Werte in diese Spalte gibt.

[17]:

df[df["tenure"] == 0].index

[17]:

Index([488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754], dtype='int64')

Feststellung und Entscheidung¶

Der Vergleich zeigt, dass es keine zusätzlichen fehlenden Werte in der „tenure“-Spalte gibt.

Wir löschen die Zeilen mit den fehlenden Werten in „tenure“ und „TotalCharges“, da es nur 11 von 7043 Zeilen sind und wir nicht von gravierenden Konsequenzen durch das Löschen dieser Zeilen ausgehen.

Wenn es nur die fehlenden Werte in „TotalCharges“ wäre, könnte man auch versuchen, die Werte darin durch Extrapolation aufzufüllen mit z.B. mean.

[18]:

# df.drop(df['tenure'] == 0, axis=0, inplace=False)

df_new = df[df["tenure"] != 0]

df_new[df_new['tenure'] == 0].index

[18]:

Index([], dtype='int64')

[19]:

df_new2 = df.drop(labels=df[df['tenure'] == 0].index, axis=0, inplace=False)
df_new2[df_new2['tenure'] == 0].index

[19]:

Index([], dtype='int64')

[20]:

df_new2

[20]:

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
2	3668-QPYBK	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7038	6840-RESVB	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	No
7039	2234-XADUH	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	No
7040	4801-JZAZL	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	No
7041	8361-LTMKD	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	Yes
7042	3186-AJIEK	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	No

7032 rows × 21 columns

[21]:

df[df["tenure"]==0].index

[21]:

Index([488, 753, 936, 1082, 1340, 3331, 3826, 4380, 5218, 6670, 6754], dtype='int64')

[22]:

print(df.shape)
df.drop(labels=df[df["tenure"] == 0].index, axis=0, inplace=True)
df

(7043, 21)

[22]:

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
2	3668-QPYBK	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7038	6840-RESVB	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	No
7039	2234-XADUH	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	No
7040	4801-JZAZL	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	No
7041	8361-LTMKD	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	Yes
7042	3186-AJIEK	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	No

7032 rows × 21 columns

[23]:

df.reset_index()
df

[23]:

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
2	3668-QPYBK	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7038	6840-RESVB	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	No
7039	2234-XADUH	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	No
7040	4801-JZAZL	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	No
7041	8361-LTMKD	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	Yes
7042	3186-AJIEK	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	No

7032 rows × 21 columns

[24]:

df.reset_index(inplace=True)
df

[24]:

	index	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
2	2	3668-QPYBK	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	3	7795-CFOCW	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	4	9237-HQITU	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7027	7038	6840-RESVB	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	No
7028	7039	2234-XADUH	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	No
7029	7040	4801-JZAZL	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	No
7030	7041	8361-LTMKD	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	Yes
7031	7042	3186-AJIEK	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	No

7032 rows × 22 columns

[25]:

if "index" in df:
    df.drop("index", axis=1, inplace=True)
df

[25]:

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
2	3668-QPYBK	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7027	6840-RESVB	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	No
7028	2234-XADUH	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	No
7029	4801-JZAZL	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	No
7030	8361-LTMKD	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	Yes
7031	3186-AJIEK	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	No

7032 rows × 21 columns

[26]:

df[df["TotalCharges"].isnull()]

[26]:

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn

[27]:

df[~df["TotalCharges"].isnull()]

[27]:

	customerID	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
1	5575-GNVDE	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
2	3668-QPYBK	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
3	7795-CFOCW	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
4	9237-HQITU	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7027	6840-RESVB	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	No
7028	2234-XADUH	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	No
7029	4801-JZAZL	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	No
7030	8361-LTMKD	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	Yes
7031	3186-AJIEK	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	No

7032 rows × 21 columns

[28]:

df.isnull().sum()

[28]:

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

EDA - Explorative Data Analysis (Daten Exploration)¶

[29]:

df["InternetService"].describe()

[29]:

count            7032
unique              3
top       Fiber optic
freq             3096
Name: InternetService, dtype: object

[30]:

df["InternetService"].unique()

[30]:

<ArrowStringArray>
['DSL', 'Fiber optic', 'No']
Length: 3, dtype: str

[31]:

df["InternetService"].value_counts()

[31]:

InternetService
Fiber optic    3096
DSL            2416
No             1520
Name: count, dtype: int64

[32]:

numerical_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
df[numerical_cols].describe()

[32]:

	tenure	MonthlyCharges	TotalCharges
count	7032.000000	7032.000000	7032.000000
mean	32.421786	64.798208	2283.300441
std	24.545260	30.085974	2266.771362
min	1.000000	18.250000	18.800000
25%	9.000000	35.587500	401.450000
50%	29.000000	70.350000	1397.475000
75%	55.000000	89.862500	3794.737500
max	72.000000	118.750000	8684.800000

[33]:

df["gender"].value_counts()

[33]:

gender
Male      3549
Female    3483
Name: count, dtype: int64

[34]:

df["Churn"].value_counts()

[34]:

Churn
No     5163
Yes    1869
Name: count, dtype: int64

[35]:

# Anzahl der gechurnten Kunden pro Geschlecht ermitteln
print("Churned Customers")
print(df["Churn"][df["Churn"]=="Yes"].groupby(by=df["gender"]).count())
print()
print("Not-yet-churned Customers")
df["Churn"][df["Churn"]=="No"].groupby(by=df["gender"]).count()

Churned Customers
gender
Female    939
Male      930
Name: Churn, dtype: int64

Not-yet-churned Customers

[35]:

gender
Female    2544
Male      2619
Name: Churn, dtype: int64

[36]:

## Visualizierung der Churn Verteilung vs. Geschlecht

plt.figure(figsize=(6, 6))
labels =["Churn: Yes","Churn:No"]
values = [1869,5163]
labels_gender = ["F","M","F","M"]
sizes_gender = [939,930 , 2544,2619]
colors = ["#ff6666", "#66b3ff"]
colors_gender = ["#c2c2f0", "#ffb3e6", "#c2c2f0", "#ffb3e6"]
explode = (0.3,0.3)
explode_gender = (0.1,0.1,0.1,0.1)
textprops = {"fontsize":15}
#Plot
plt.pie(values, labels=labels,autopct="%1.1f%%",pctdistance=1.08, labeldistance=0.8,colors=colors, startangle=90,frame=True, explode=explode,radius=10, textprops =textprops, counterclock = True, )
plt.pie(sizes_gender,labels=labels_gender,colors=colors_gender,startangle=90, explode=explode_gender,radius=7, textprops =textprops, counterclock = True, )
#Draw circle
centre_circle = plt.Circle((0,0),5,color="black", fc="white",linewidth=0)
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

plt.title("Churn Distribution w.r.t Gender: Male(M), Female(F)", fontsize=15, y=1.1)

# show plot

plt.axis("equal")
plt.tight_layout()
plt.show()

../_images/1intro_customer-churn_46_0.png

[37]:

## Visualisierung Verteilung: Churn vs. Kundenvertrag

fig = px.histogram(df, x="Churn", color="Contract", barmode="group", title="<b>Verteilung: Churn vs. Kundenvertrag<b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

../_images/1intro_customer-churn_47_0.png

[38]:

# Verteilung: Churn vs. Kunden Zahlungsmethod distribution w.r.t. Churn

fig = px.histogram(df, x="Churn", color="PaymentMethod", title="<b>Verteilung: Churn vs. Kunden Zahlungsmethod distribution w.r.t. Churn</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

../_images/1intro_customer-churn_48_0.png

Wir beobachten folgendes:

Die meisten Kunden, die abgewandert sind, hatten einen elektronischen Scheck als Zahlungsmethode.
Kunden, die sich für die automatische Überweisung per Kreditkarte oder Banküberweisung und den Versand von Schecks als Zahlungsmethode entschieden, sind seltener abgewandert.

[39]:

df["InternetService"].unique()

[39]:

<ArrowStringArray>
['DSL', 'Fiber optic', 'No']
Length: 3, dtype: str

[40]:

df[df["gender"]=="Male"][["InternetService", "Churn"]].value_counts()

[40]:

InternetService  Churn
DSL              No       992
Fiber optic      No       910
No               No       717
Fiber optic      Yes      633
DSL              Yes      240
No               Yes       57
Name: count, dtype: int64

[41]:

df[df["gender"]=="Female"][["InternetService", "Churn"]].value_counts()

[41]:

InternetService  Churn
DSL              No       965
Fiber optic      No       889
No               No       690
Fiber optic      Yes      664
DSL              Yes      219
No               Yes       56
Name: count, dtype: int64

[42]:

fig = go.Figure()

fig.add_trace(go.Bar(
  x = [["Churn:No", "Churn:No", "Churn:Yes", "Churn:Yes"],
       ["Female", "Male", "Female", "Male"]],
  y = [965, 992, 219, 240],
  name = 'DSL',
))

fig.add_trace(go.Bar(
  x = [["Churn:No", "Churn:No", "Churn:Yes", "Churn:Yes"],
       ["Female", "Male", "Female", "Male"]],
  y = [889, 910, 664, 633],
  name = "Fiber optic",
))

fig.add_trace(go.Bar(
  x = [["Churn:No", "Churn:No", "Churn:Yes", "Churn:Yes"],
       ["Female", "Male", "Female", "Male"]],
  y = [690, 717, 56, 57],
  name = "No Internet",
))

fig.update_layout(title_text="<b>Verteilung: Churn vs. InternetService und Geschlecht</b>")

fig.show()

../_images/1intro_customer-churn_53_0.png

Viele Kunden hatten sich für den Glasfaserdienst (Fiber optic) entschieden. Diese scheinen eine hohe Abwanderungsrate zu haben, was auf eine Unzufriedenheit mit dieser Art von Internetdienst hindeuten könnte.

Kunden, die einen DSL-Dienst nutzen, sind in der Mehrzahl und haben eine geringere Abwanderungsrate als Glasfaserkunden.

[43]:

color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="Churn", color="Dependents", barmode="group", title="<b>Verteilung: Churn vs. Familienangehörige</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

../_images/1intro_customer-churn_55_0.png

Es scheint, dass die Kunden ohne Familienangehörige („dependents“) wahrscheinlicher abwandern.

[44]:

color_map = {"Yes": '#FFA15A', "No": '#00CC96'}
fig = px.histogram(df, x="Churn", color="Partner", barmode="group", title="<b>Verteilung: Churn vs. Partners</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

../_images/1intro_customer-churn_57_0.png

Kunden OHNE Partner scheinen eher dazu zu tendieren abzuwandern.

[45]:

color_map = {"Yes": "#00CC96", "No": "#B6E880"}
fig = px.histogram(df, x="Churn", color="SeniorCitizen", title="<b>Verteilung: Churn vs. Senior Citizen</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

../_images/1intro_customer-churn_59_0.png

Wir stellen fest, dass generell der Anteil an „seniorigen“ Kunden recht niedrig ist.
Dazu kann man sagen, dass ein die seniorigen Kunden eher dazu tendieren abzuwandern.

[46]:

color_map = {"Yes": "#FF97FF", "No": "#AB63FA"}
fig = px.histogram(df, x="Churn", color="OnlineSecurity", barmode="group", title="<b>Verteilung: Churn vs. Online Security</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

../_images/1intro_customer-churn_61_0.png

Kunden OHNE Online Security tendieren stark dazu abzuwandern.

[47]:

color_map = {"Yes": "#FFA15A", "No": "#00CC96"}
fig = px.histogram(df, x="Churn", color="PaperlessBilling",  title="<b>Verteilung: Churn vs. Papierlose Rechnung</b>", color_discrete_map=color_map)
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

../_images/1intro_customer-churn_63_0.png

Kunden mit Papierlosen Zahlen scheinen dazu zu tendieren, abzuwandern.

[48]:

fig = px.histogram(df, x="Churn", color="TechSupport",barmode="group",  title="<b>Verteilung: Churn vs. TechSupport</b>")
fig.update_layout(width=700, height=500, bargap=0.1)
fig.show()

../_images/1intro_customer-churn_65_0.png

Kunden OHNE TechSupport scheinen stark dazu zu tendieren abzuwandern

[49]:

sns.set_context("paper",font_scale=1.1)
ax = sns.kdeplot(df.MonthlyCharges[(df["Churn"] == "No") ],
                color="Red", shade = True);
ax = sns.kdeplot(df.MonthlyCharges[(df["Churn"] == "Yes") ],
                ax =ax, color="Blue", shade= True);
ax.legend(["Not Churn","Churn"],loc='upper right');
ax.set_ylabel("Density");
ax.set_xlabel("Monthly Charges");
ax.set_title("Verteilung: Churn vs. Monthly Charges");

../_images/1intro_customer-churn_67_0.png

Kunden MIT höheren monatlichen Kosten tendieren eher dazu abzuwandern

[50]:

ax = sns.kdeplot(df.TotalCharges[(df["Churn"] == "No") ],
                color="Gold", shade = True);
ax = sns.kdeplot(df.TotalCharges[(df["Churn"] == "Yes") ],
                ax =ax, color="Green", shade= True);
ax.legend(["Not Chu0rn","Churn"],loc="upper right");
ax.set_ylabel("Density");
ax.set_xlabel("Total Charges");
ax.set_title("Verteilung: Churn vs. Total Charges");

../_images/1intro_customer-churn_69_0.png

Interessanterweise scheint der höhere Total Charge keine direkte Auswirkung auf erhöhte Abwanderungwahrscheinlichkeit zu haben.

[51]:

fig = px.box(df, x="Churn", y = "tenure")

# Update yaxis properties
fig.update_yaxes(title_text="Tenure (Months)", row=1, col=1)
# Update xaxis properties
fig.update_xaxes(title_text="Churn", row=1, col=1)

# Update size and title
fig.update_layout(autosize=True, width=750, height=600,
    title_font=dict(size=25, family="Courier"),
    title="<b>Tenure vs Churn</b>",
)

fig.show()

../_images/1intro_customer-churn_71_0.png

NEUE Kunden scheinen tendenziell eher abzuwandern.

[52]:

# Plotte den Correlation Heatmap (pandas nutzt Pearson Correlation by default)

plt.figure(figsize=(25, 10))

corr = df.apply(lambda x: pd.factorize(x)[0]).corr()

mask = np.triu(np.ones_like(corr, dtype=bool))

ax = sns.heatmap(corr, mask=mask, xticklabels=corr.columns, yticklabels=corr.columns, annot=True, linewidths=.2, cmap="coolwarm", vmin=-1, vmax=1)

../_images/1intro_customer-churn_73_0.png

[53]:

df.OnlineSecurity.value_counts()

[53]:

OnlineSecurity
No                     3497
Yes                    2015
No internet service    1520
Name: count, dtype: int64

Data Preprocessing¶

Zielvariable CHURN¶

Da es sich um ein binäres Klassifikationsproblem hanaelt, sollte die Target-Variable eine binäre Variable sein (numerisch ausgedrückt). Da sie aktuell noch nur String ist, wird nun die Churn-Spalte als Erstes in binary umgewandelt.

[54]:

df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})

df.head()

[54]:

	customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	0
1	5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	0
2	3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	1
3	7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	0
4	9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	1

Wir entfernen die customerID-Spalte.

[55]:

## Verwerfe ID-columns
if "customerID" in df:
    df.drop("customerID", axis=1, inplace=True)

df

[55]:

	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	0
1	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	0
2	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	1
3	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	0
4	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7027	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	0
7028	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	0
7029	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	0
7030	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	1
7031	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	0

7032 rows × 20 columns

Nun haben wir uns schon sehr vieles angeschaut und viel getan …¶

Muessten wir noch etwas tun, bevor wir die Daten splitten? Können wir den Datensatz nun direkt dem Modell zuführen?¶

Kategorische Spalten umwandeln¶

Um die kategorischen Spalten richtig in numerische Spalten umzuwandeln, lass uns sie genauer anschauen und identifizieren, was sich hinter den Strings wirklich verbirgt.

[56]:

df

[56]:

	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
0	Female	0	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	0
1	Male	0	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	0
2	Male	0	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	1
3	Male	0	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	0
4	Female	0	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7027	Male	0	Yes	Yes	24	Yes	Yes	DSL	Yes	No	Yes	Yes	Yes	Yes	One year	Yes	Mailed check	84.80	1990.50	0
7028	Female	0	Yes	Yes	72	Yes	Yes	Fiber optic	No	Yes	Yes	No	Yes	Yes	One year	Yes	Credit card (automatic)	103.20	7362.90	0
7029	Female	0	Yes	Yes	11	No	No phone service	DSL	Yes	No	No	No	No	No	Month-to-month	Yes	Electronic check	29.60	346.45	0
7030	Male	1	Yes	No	4	Yes	Yes	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Mailed check	74.40	306.60	1
7031	Male	0	No	No	66	Yes	No	Fiber optic	Yes	No	Yes	Yes	Yes	Yes	Two year	Yes	Bank transfer (automatic)	105.65	6844.50	0

7032 rows × 20 columns

DatenAnalyse:¶

Ist die Spalte:

binary?
boolean?
categorical?
numeric?

Datentypen und ihre Spalten:¶

binary: gender;
boolean: SeniorCitizen(*), Partner, Dependents, PhoneService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, PaperlessBilling
categorical: MultipleLines, InternetService, Contract, PaymentMethod
numeric: tenure, MonthlyCharges, TotalCharges

(*)SeniorCitizen: In diesem Datensatz ist die Spalte bereits durch 0 und 1 ausgedrückt, d.h. rein rechnerisch bräuchte sie nicht nochmal extra als boolean aufgenommen werden. Allerdings sorgt es für mehr Robustheit und Stabilität in der Pipeline und somit Best Practice, doch alle Datentypen entsprechend zu identifizieren.

Recap:¶

OrdinalEncoder: gut für kategorische Werte, die eine innere Ordnung bzw Rang haben, z.B. „niedrig“, „medium“, „hoch“ in Umfrageantworte
OneHotEncoder: gut für kategorische Features ohne bestimmte innere Ordnung in einem „one-hot numerischen Array“.

Achtung: OneHotEncoder kann die Dimension (Anzahl Spalten) der Feature-Matrix drastisch erhöhen und „spärliche Matrizen“ (scarce matrices) mit lauter „0“ in der Spalte kreieren, wenn die Originalspalte viele verschieden Werte hat (hohe Kardinalität), was wiederum schlechte Datenqualität bedeuten würde. Alternativen: Es gibt auch weitere Alternativen zum One-Hot-Encoding, z.B. kann man auf Frequency (Count) Encoding ausweichen:
- hier werden Kategorien basierend auf der Frequency (Häufigkeit, in denen sie auftreten) im Datensatz. Das kann helfen, die Wichtigkeit dieser Kategorie im Datensatz direkt zu erfassen und mitzugeben. Hat aber auch Pitfalls, z.B. wenn das ein biased Datensatz ist.

Achtung!!

Label-Encoder (in scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html):

transformiert die ZIELVARIABLE in integers von 0 bis n-1 (n=Anzahl Klassen), gut geeignet bei „gleichwertigen“ Klassen

Unterteile die Columns in 3 Kategorien: Standardisierung, Ordinal-Encoding and One-Hot-Encoding¶

binary (direkte Umwandlung zu 0-1): gender

boolean (direkte Umwandlung zu 0-1): SeniorCitizen, Partner, Dependents, PhoneService, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, StreamingMovies, PaperlessBilling

categorical (One-Hot oder Frequency Encoding etc): MultipleLines, InternetService, Contract, PaymentMethod

numeric: tenure, MonthlyCharges, TotalCharges

[57]:

cols_binary = ["gender"]
cols_boolean = ["SeniorCitizen", "Partner", "Dependents", "PhoneService", "OnlineSecurity", "OnlineBackup",
               "DeviceProtection", "TechSupport", "StreamingTV", "StreamingMovies", "PaperlessBilling"]
cols_cat = ["MultipleLines", "InternetService", "Contract", "PaymentMethod"]
cols_numeric = ["tenure", "MonthlyCharges", "TotalCharges"]

[58]:

df[cols_boolean]

[58]:

	SeniorCitizen	Partner	Dependents	PhoneService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	PaperlessBilling
0	0	Yes	No	No	No	Yes	No	No	No	No	Yes
1	0	No	No	Yes	Yes	No	Yes	No	No	No	No
2	0	No	No	Yes	Yes	Yes	No	No	No	No	Yes
3	0	No	No	No	Yes	No	Yes	Yes	No	No	No
4	0	No	No	Yes	No	No	No	No	No	No	Yes
...	...	...	...	...	...	...	...	...	...	...	...
7027	0	Yes	Yes	Yes	Yes	No	Yes	Yes	Yes	Yes	Yes
7028	0	Yes	Yes	Yes	No	Yes	Yes	No	Yes	Yes	Yes
7029	0	Yes	Yes	No	Yes	No	No	No	No	No	Yes
7030	1	Yes	No	Yes	No	No	No	No	No	No	Yes
7031	0	No	No	Yes	Yes	No	Yes	Yes	Yes	Yes	Yes

7032 rows × 11 columns

[59]:

df[cols_cat]

[59]:

	MultipleLines	InternetService	Contract	PaymentMethod
0	No phone service	DSL	Month-to-month	Electronic check
1	No	DSL	One year	Mailed check
2	No	DSL	Month-to-month	Mailed check
3	No phone service	DSL	One year	Bank transfer (automatic)
4	No	Fiber optic	Month-to-month	Electronic check
...	...	...	...	...
7027	Yes	DSL	One year	Mailed check
7028	Yes	Fiber optic	One year	Credit card (automatic)
7029	No phone service	DSL	Month-to-month	Electronic check
7030	Yes	Fiber optic	Month-to-month	Mailed check
7031	No	Fiber optic	Two year	Bank transfer (automatic)

7032 rows × 4 columns

[60]:

# Mapping von "Female" zu 1 und "Male" zu 0

df["gender"] = df["gender"].map({"Female": 1, "Male": 0})
df["gender"]

[60]:

0       1
1       0
2       0
3       0
4       1
       ..
7027    0
7028    1
7029    1
7030    0
7031    0
Name: gender, Length: 7032, dtype: int64

[61]:

cols_boolean

[61]:

['SeniorCitizen',
 'Partner',
 'Dependents',
 'PhoneService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'PaperlessBilling']

One-Hot-Encoding Spalten mit Werten auffüllen.¶

Von den original-kartegorischen Spaltenwerten zu den neuen Werten in den neuen One-Hot-Encoded Spalten:

Wir erinnern uns: Ziel liegt darin, die die Spalten mit dem Textwert ‚yes‘ / ‚no‘ in numerische Binärwerte umzuwandeln

eq("yes"):
- vergleicht elementweise: Ist der Wert in der Spalte gleich "yes"?
- Ergebnis ist eine Boolean-Serie: True für "yes", False für "no".
- falls Spalte bereits als boolean vorliegt, bräuchte man das ggf nicht, da direkt multipliziert werden kann.
mul(1)
- elementweise Multiplikation und mappt Booleans zu Zahlen (True wird intern als 1 interpretiert und False als 0), daher wird True * 1 => 1 und False * 1 => 0. Somit wird aus ‚yes/no‘ effektiv 1/0.

=> d.h. man könnte auch df[col].map({"yes":1,"no":0}) nutzen anstatt df[col].eq("yes").mul(1).

[62]:

for col in cols_boolean:
    df[col] = df[col].eq("yes").mul(1)

df[cols_boolean]

[62]:

	SeniorCitizen	Partner	Dependents	PhoneService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	PaperlessBilling
0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...
7027	0	0	0	0	0	0	0	0	0	0	0
7028	0	0	0	0	0	0	0	0	0	0	0
7029	0	0	0	0	0	0	0	0	0	0	0
7030	0	0	0	0	0	0	0	0	0	0	0
7031	0	0	0	0	0	0	0	0	0	0	0

7032 rows × 11 columns

[63]:

# One-Hot-Encoding für die drei kategorischen Spalten

df_onehot_encoded = pd.get_dummies(df, columns=cols_cat) # dtype=int

print(df_onehot_encoded.shape)
df_onehot_encoded

(7032, 29)

[63]:

	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	PaperlessBilling	MonthlyCharges	TotalCharges	Churn	MultipleLines_No	MultipleLines_No phone service	MultipleLines_Yes	InternetService_DSL	InternetService_Fiber optic	InternetService_No	Contract_Month-to-month	Contract_One year	Contract_Two year	PaymentMethod_Bank transfer (automatic)	PaymentMethod_Credit card (automatic)	PaymentMethod_Electronic check	PaymentMethod_Mailed check
0	1	0	0	0	1	0	0	0	0	0	0	0	0	29.85	29.85	0	False	True	False	True	False	False	True	False	False	False	False	True	False
1	0	0	0	0	34	0	0	0	0	0	0	0	0	56.95	1889.50	0	True	False	False	True	False	False	False	True	False	False	False	False	True
2	0	0	0	0	2	0	0	0	0	0	0	0	0	53.85	108.15	1	True	False	False	True	False	False	True	False	False	False	False	False	True
3	0	0	0	0	45	0	0	0	0	0	0	0	0	42.30	1840.75	0	False	True	False	True	False	False	False	True	False	True	False	False	False
4	1	0	0	0	2	0	0	0	0	0	0	0	0	70.70	151.65	1	True	False	False	False	True	False	True	False	False	False	False	True	False
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7027	0	0	0	0	24	0	0	0	0	0	0	0	0	84.80	1990.50	0	False	False	True	True	False	False	False	True	False	False	False	False	True
7028	1	0	0	0	72	0	0	0	0	0	0	0	0	103.20	7362.90	0	False	False	True	False	True	False	False	True	False	False	True	False	False
7029	1	0	0	0	11	0	0	0	0	0	0	0	0	29.60	346.45	0	False	True	False	True	False	False	True	False	False	False	False	True	False
7030	0	0	0	0	4	0	0	0	0	0	0	0	0	74.40	306.60	1	False	False	True	False	True	False	True	False	False	False	False	False	True
7031	0	0	0	0	66	0	0	0	0	0	0	0	0	105.65	6844.50	0	True	False	False	False	True	False	False	False	True	True	False	False	False

7032 rows × 29 columns

[64]:

df_onehot_encoded.columns.tolist()

[64]:

['gender',
 'SeniorCitizen',
 'Partner',
 'Dependents',
 'tenure',
 'PhoneService',
 'OnlineSecurity',
 'OnlineBackup',
 'DeviceProtection',
 'TechSupport',
 'StreamingTV',
 'StreamingMovies',
 'PaperlessBilling',
 'MonthlyCharges',
 'TotalCharges',
 'Churn',
 'MultipleLines_No',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaymentMethod_Bank transfer (automatic)',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check']

[65]:

index_label = df_onehot_encoded.columns.tolist().index("Churn")
print(df.columns.tolist()[index_label])
cols_onehot_encoded = df_onehot_encoded.columns.tolist()[index_label+1:]
print(len(cols_onehot_encoded))
cols_onehot_encoded

PaperlessBilling
13

[65]:

['MultipleLines_No',
 'MultipleLines_No phone service',
 'MultipleLines_Yes',
 'InternetService_DSL',
 'InternetService_Fiber optic',
 'InternetService_No',
 'Contract_Month-to-month',
 'Contract_One year',
 'Contract_Two year',
 'PaymentMethod_Bank transfer (automatic)',
 'PaymentMethod_Credit card (automatic)',
 'PaymentMethod_Electronic check',
 'PaymentMethod_Mailed check']

[ ]:

One-Hot-Encoding Spalten mit Werten auffüllen. (genaue Erklärung siehe oben)¶

Die df[col].eq("yes").mul(1) transformiert lediglich von "yes" zu 1 und alles andere zu 0. D.h. man könnte auch df[col].map({"yes":1, "no":0}) nutzen anstatt df[col].eq("yes").mul(1).

[66]:

for col in cols_onehot_encoded:
    df_onehot_encoded[col] = df_onehot_encoded[col].eq('yes').mul(1)

[67]:

df_onehot_encoded

[67]:

	gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	PaperlessBilling	MonthlyCharges	TotalCharges	Churn	MultipleLines_No	MultipleLines_No phone service	MultipleLines_Yes	InternetService_DSL	InternetService_Fiber optic	InternetService_No	Contract_Month-to-month	Contract_One year	Contract_Two year	PaymentMethod_Bank transfer (automatic)	PaymentMethod_Credit card (automatic)	PaymentMethod_Electronic check	PaymentMethod_Mailed check
0	1	0	0	0	1	0	0	0	0	0	0	0	0	29.85	29.85	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	34	0	0	0	0	0	0	0	0	56.95	1889.50	0	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	2	0	0	0	0	0	0	0	0	53.85	108.15	1	0	0	0	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	45	0	0	0	0	0	0	0	0	42.30	1840.75	0	0	0	0	0	0	0	0	0	0	0	0	0	0
4	1	0	0	0	2	0	0	0	0	0	0	0	0	70.70	151.65	1	0	0	0	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
7027	0	0	0	0	24	0	0	0	0	0	0	0	0	84.80	1990.50	0	0	0	0	0	0	0	0	0	0	0	0	0	0
7028	1	0	0	0	72	0	0	0	0	0	0	0	0	103.20	7362.90	0	0	0	0	0	0	0	0	0	0	0	0	0	0
7029	1	0	0	0	11	0	0	0	0	0	0	0	0	29.60	346.45	0	0	0	0	0	0	0	0	0	0	0	0	0	0
7030	0	0	0	0	4	0	0	0	0	0	0	0	0	74.40	306.60	1	0	0	0	0	0	0	0	0	0	0	0	0	0
7031	0	0	0	0	66	0	0	0	0	0	0	0	0	105.65	6844.50	0	0	0	0	0	0	0	0	0	0	0	0	0	0

7032 rows × 29 columns

Split the Data to x_train, x_test, y_train, y_test¶

[68]:

# Split X and y
X = df_onehot_encoded.drop(columns = ["Churn"])
y = df_onehot_encoded["Churn"].values

[69]:

# Splitting the data into train and test sets

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.30, random_state = 42, stratify=y)

[70]:

def distplot(feature, frame, color="r"):
    plt.figure(figsize=(8,3))
    plt.title("Verteilung für {}".format(feature))
    ax = sns.distplot(frame[feature], color= color)

num_cols = ["tenure", "MonthlyCharges", "TotalCharges"]
for feat in num_cols: distplot(feat, df)

../_images/1intro_customer-churn_105_0.png

../_images/1intro_customer-churn_105_1.png

../_images/1intro_customer-churn_105_2.png

Überlegung: Welchen Scaler sollte ich nutzen?¶

Da die numerischen Features über verschiedene Wertebereiche (value ranges) verteilt sind, können wir hier den Standard Scalar nutzen, um sie alle auf den gleichen Bereich runterzuskalieren.

Normalisiere die numerischen Attribute¶

[71]:

df_std = pd.DataFrame(StandardScaler().fit_transform(df[num_cols].astype('float64')),
                       columns=num_cols)
for feat in numerical_cols: distplot(feat, df_std, color='c')

../_images/1intro_customer-churn_108_0.png

../_images/1intro_customer-churn_108_1.png

../_images/1intro_customer-churn_108_2.png

…

[72]:

scaler= StandardScaler()

X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

Modellieren¶

Es handelt sich hierbei um einen Datensatz mit klar definierten Labels (Churn: Yes oder No), die eine Klassifikationsproblematik beschreiben.

Dazu werden wir nun mehrere Klassifikationsmodelle testen.

K-Nearest Neighbor Classifier¶

Der K-Nearest-Neighbor Classifier ist eine nichtparametrische Methode zur Schätzung von Wahrscheinlichkeitsdichtefunktionen, mehr Details: https://de.wikipedia.org/wiki/N%C3%A4chste-Nachbarn-Klassifikation

Hier nutzen wir die kNN-Implemtation von scikit-learn: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

[73]:

# Initiere das knn-Modell

knn_model = KNeighborsClassifier(n_neighbors = 11)

[74]:

# Trainiere das Modell mit unseren Trainingsdaten

knn_model.fit(X_train,y_train)

[74]:

KNeighborsClassifier(n_neighbors=11)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

[75]:

# die "score"-Funktion scoret direkt die predicted Resultate vom X_test gegen die wahren y-Werte und gibt die Accuracy zurück

accuracy_knn = knn_model.score(X_test,y_test)
print("KNN accuracy:",accuracy_knn)
print(knn_model.score(X_test,y_test))

KNN accuracy: 0.7611374407582938
0.7611374407582938

[76]:

# Alternativ kann man direkt mit dem Modell auf den X_test (Testdatensatz) vorhersagen und diese Vorhersagen gegen die wahren Werte des Testdatensatzes mit sklearn evaluieren

y_pred = knn_model.predict(X_test)
prec = precision_score(y_test, y_pred, average=None)
recall = recall_score(y_test, y_pred, average=None)
print("precision: ", prec)
print("recall: ", recall)

precision:  [0.81119714 0.56612529]
recall:  [0.87927695 0.43493761]

[77]:

# Zudem gibt es den "Classification Report", der direkt mehrere Metriken Resultate ausgibt

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.88      0.84      1549
           1       0.57      0.43      0.49       561

    accuracy                           0.76      2110
   macro avg       0.69      0.66      0.67      2110
weighted avg       0.75      0.76      0.75      2110

[78]:

# Predict directly with the trained model

y_pred = knn_model.predict(X_test)
print(y_pred)

[0 1 0 ... 0 0 0]

Logistic Regression Classifier¶

Der Logistic Regression Classifier basiert auf Linear Regression, nur gibt sie am Ende eine

[79]:

lr_model = LogisticRegression()
lr_model.fit(X_train,y_train)
accuracy_lr = lr_model.score(X_test,y_test)
print("Logistic Regression accuracy is :",accuracy_lr)

Logistic Regression accuracy is : 0.776303317535545

[80]:

# Make predictions
lr_pred= lr_model.predict(X_test)
report = classification_report(y_test,lr_pred)
print(report)

              precision    recall  f1-score   support

           0       0.81      0.90      0.86      1549
           1       0.61      0.42      0.50       561

    accuracy                           0.78      2110
   macro avg       0.71      0.66      0.68      2110
weighted avg       0.76      0.78      0.76      2110

[81]:

plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, lr_pred),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("LOGISTIC REGRESSION CONFUSION MATRIX",fontsize=14)
plt.show()

../_images/1intro_customer-churn_122_0.png

Decision Tree Classifier¶

[82]:

dt_model = DecisionTreeClassifier()
dt_model.fit(X_train,y_train)
predictdt_y = dt_model.predict(X_test)
accuracy_dt = dt_model.score(X_test,y_test)
print("Decision Tree accuracy is :",accuracy_dt)

Decision Tree accuracy is : 0.7018957345971564

[83]:

print(classification_report(y_test, predictdt_y))

              precision    recall  f1-score   support

           0       0.80      0.79      0.80      1549
           1       0.44      0.45      0.45       561

    accuracy                           0.70      2110
   macro avg       0.62      0.62      0.62      2110
weighted avg       0.70      0.70      0.70      2110

AdaBoost Classifier¶

[84]:

a_model = AdaBoostClassifier()
a_model.fit(X_train,y_train)
a_preds = a_model.predict(X_test)
print("AdaBoost Classifier accuracy")
metrics.accuracy_score(y_test, a_preds)

AdaBoost Classifier accuracy

[84]:

0.7838862559241706

[85]:

print(classification_report(y_test, a_preds))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86      1549
           1       0.63      0.47      0.53       561

    accuracy                           0.78      2110
   macro avg       0.72      0.68      0.70      2110
weighted avg       0.77      0.78      0.77      2110

[86]:

plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, a_preds),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("AdaBoost Classifier Confusion Matrix",fontsize=14)
plt.show()

../_images/1intro_customer-churn_129_0.png

Gradient Boosting Classifier¶

[87]:

gb = GradientBoostingClassifier()
gb.fit(X_train, y_train)
gb_pred = gb.predict(X_test)
print("Gradient Boosting Classifier", accuracy_score(y_test, gb_pred))

Gradient Boosting Classifier 0.7781990521327015

[88]:

print(classification_report(y_test, gb_pred))

              precision    recall  f1-score   support

           0       0.82      0.90      0.86      1549
           1       0.62      0.44      0.51       561

    accuracy                           0.78      2110
   macro avg       0.72      0.67      0.68      2110
weighted avg       0.76      0.78      0.76      2110

[89]:

plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, gb_pred),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("Gradient Boosting Classifier Confusion Matrix",fontsize=14)
plt.show()

../_images/1intro_customer-churn_133_0.png

Voting Classifier¶

Der Voting Classifier ist ein Mixture Modell, der mehrere Modelle vereint, die alle ihre Predictions machen und dann über Majority voting entschieden wird, welche Prediction Output rausgegeben wird.

[90]:

from sklearn.ensemble import VotingClassifier
clf1 = GradientBoostingClassifier()
clf2 = LogisticRegression()
clf3 = AdaBoostClassifier()
eclf1 = VotingClassifier(estimators=[("gbc", clf1), ("lr", clf2), ("abc", clf3)], voting="soft")
eclf1.fit(X_train, y_train)
predictions = eclf1.predict(X_test)
print("Final Accuracy Score ")
print(accuracy_score(y_test, predictions))

Final Accuracy Score
0.7834123222748816

[91]:

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

           0       0.82      0.91      0.86      1549
           1       0.63      0.44      0.52       561

    accuracy                           0.78      2110
   macro avg       0.73      0.67      0.69      2110
weighted avg       0.77      0.78      0.77      2110

[92]:

plt.figure(figsize=(4,3))
sns.heatmap(confusion_matrix(y_test, predictions),
                annot=True,fmt = "d",linecolor="k",linewidths=3)

plt.title("FINAL CONFUSION MATRIX",fontsize=14)
plt.show()

../_images/1intro_customer-churn_137_0.png

	n_neighbors n_neighbors: int, default=5 Number of neighbors to use by default for :meth:`kneighbors` queries.	11
	weights weights: {'uniform', 'distance'}, callable or None, default='uniform' Weight function used in prediction. Possible values: - 'uniform' : uniform weights. All points in each neighborhood are weighted equally. - 'distance' : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away. - [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights. Refer to the example entitled :ref:`sphx_glr_auto_examples_neighbors_plot_classification.py` showing the impact of the `weights` parameter on the decision boundary.	'uniform'
	algorithm algorithm: {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto' Algorithm used to compute the nearest neighbors: - 'ball_tree' will use :class:`BallTree` - 'kd_tree' will use :class:`KDTree` - 'brute' will use a brute-force search. - 'auto' will attempt to decide the most appropriate algorithm based on the values passed to :meth:`fit` method. Note: fitting on sparse input will override the setting of this parameter, using brute force.	'auto'
	leaf_size leaf_size: int, default=30 Leaf size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.	30
	p p: float, default=2 Power parameter for the Minkowski metric. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. This parameter is expected to be positive.	2
	metric metric: str or callable, default='minkowski' Metric to use for distance computation. Default is "minkowski", which results in the standard Euclidean distance when p = 2. See the documentation of `scipy.spatial.distance `_ and the metrics listed in :class:`~sklearn.metrics.pairwise.distance_metrics` for valid metric values. If metric is "precomputed", X is assumed to be a distance matrix and must be square during fit. X may be a :term:`sparse graph`, in which case only "nonzero" elements may be considered neighbors. If metric is a callable function, it takes two arrays representing 1D vectors as inputs and must return one value indicating the distance between those vectors. This works for Scipy's metrics, but is less efficient than passing the metric name as a string.	'minkowski'
	metric_params metric_params: dict, default=None Additional keyword arguments for the metric function.	None
	n_jobs n_jobs: int, default=None The number of parallel jobs to run for neighbors search. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. Doesn't affect :meth:`fit` method.	None