{ "cells": [ { "cell_type": "markdown", "id": "da945f8f-ac13-4a5d-85e2-76618c5861f3", "metadata": {}, "source": [ "# Praxisbeispiel - Datenverarbeitung - Deduplizieren\n", "\n", "## 1. Daten \n", "## 1.1 Beispieldaten laden" ] }, { "cell_type": "code", "execution_count": 1, "id": "b40d877a-7429-463b-9443-1f423d36121a", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:28.567920Z", "iopub.status.busy": "2026-03-24T17:28:28.567338Z", "iopub.status.idle": "2026-03-24T17:28:28.850366Z", "shell.execute_reply": "2026-03-24T17:28:28.849835Z", "shell.execute_reply.started": "2026-03-24T17:28:28.567873Z" } }, "outputs": [], "source": [ "import pandas as pd" ] }, { "cell_type": "code", "execution_count": 2, "id": "c5fca3df-d2c0-47a5-81e7-1ffa864bb90a", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:29.141034Z", "iopub.status.busy": "2026-03-24T17:28:29.140642Z", "iopub.status.idle": "2026-03-24T17:28:29.590721Z", "shell.execute_reply": "2026-03-24T17:28:29.589713Z", "shell.execute_reply.started": "2026-03-24T17:28:29.141011Z" } }, "outputs": [], "source": [ "customers = pd.read_csv(\n", " \"https://raw.githubusercontent.com/kjam/data-cleaning-101/master/data/customer_data_duped.csv\",\n", " encoding=\"utf-8\",\n", ")" ] }, { "cell_type": "markdown", "id": "67b62ec8-4c31-4e77-b46d-32364b01a79b", "metadata": {}, "source": [ "## 1.2 Daten verstehen \n", "### 1.2.1. Daten anzeigen" ] }, { "cell_type": "code", "execution_count": 3, "id": "c7cf08e8-bebd-47f7-824d-60f16c86c85b", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:30.271326Z", "iopub.status.busy": "2026-03-24T17:28:30.271039Z", "iopub.status.idle": "2026-03-24T17:28:30.288444Z", "shell.execute_reply": "2026-03-24T17:28:30.287551Z", "shell.execute_reply.started": "2026-03-24T17:28:30.271307Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namejobcompanystreet_addresscitystateemailuser_name
0Patricia SchaeferProgrammer, systemsEstrada-Best398 Paul DriveChristianviewDelawarelambdavid@gmail.comndavidson
1Olivie DuboisIngénieur recherche et développement en agroal...Morenorue Lucas BenardSaint Anastasie-les-BainsARberthelotjacqueline@mahe.frmanonallain
2Mary Davies-KirkPublic affairs consultantBaker LtdFlat 3\\nPugh mewsStanleyfurtZAmiddletonconor@hotmail.comcolemanmichael
3Miroslawa EckbauerDispensing opticianLadeck GmbHMijo-Lübs-Straße 12NeubrandenburgBerlinsophia01@yahoo.deromanjunitz
4Richard BauerAccountant, chartered certifiedHoffman-Rocha6541 Rodriguez WallCarlosmouthTexastross@jensen-ware.orgadam78
...........................
2075Maurice SteySystems developerLinke Margraf GmbH & Co. OHGLaila-Scheibe-Allee 2/0LuckenwaldeHamburggutknechtevelyn@niemeier.comdkreusel
2076Linda AlexanderCommrcil horiculuriWebb, Ballald and Vasquel5594 Persn CiffMooneyburyMarylandahleythoa@ail.cokennethrchn
2077Diane BaillyPharmacienVoisin527, rue DijouxDuval-les-BainsCHaruiz@reynaud.frdorothee41
2078Jorge Riba CerdánHotel managerAmador-DiegoRambla de Adriana Barceló 854 Puerta 3HuescaAsturiasmanuelamosquera@yahoo.comeugenia17
2079Ryan ThompsonBrewing technologistSmith-Sullivan136 Rodriguez PointBradfordboroughNorth Dakotalcruz@gmail.comcnewton
\n", "

2080 rows × 8 columns

\n", "
" ], "text/plain": [ " name job \\\n", "0 Patricia Schaefer Programmer, systems \n", "1 Olivie Dubois Ingénieur recherche et développement en agroal... \n", "2 Mary Davies-Kirk Public affairs consultant \n", "3 Miroslawa Eckbauer Dispensing optician \n", "4 Richard Bauer Accountant, chartered certified \n", "... ... ... \n", "2075 Maurice Stey Systems developer \n", "2076 Linda Alexander Commrcil horiculuri \n", "2077 Diane Bailly Pharmacien \n", "2078 Jorge Riba Cerdán Hotel manager \n", "2079 Ryan Thompson Brewing technologist \n", "\n", " company street_address \\\n", "0 Estrada-Best 398 Paul Drive \n", "1 Moreno rue Lucas Benard \n", "2 Baker Ltd Flat 3\\nPugh mews \n", "3 Ladeck GmbH Mijo-Lübs-Straße 12 \n", "4 Hoffman-Rocha 6541 Rodriguez Wall \n", "... ... ... \n", "2075 Linke Margraf GmbH & Co. OHG Laila-Scheibe-Allee 2/0 \n", "2076 Webb, Ballald and Vasquel 5594 Persn Ciff \n", "2077 Voisin 527, rue Dijoux \n", "2078 Amador-Diego Rambla de Adriana Barceló 854 Puerta 3 \n", "2079 Smith-Sullivan 136 Rodriguez Point \n", "\n", " city state email \\\n", "0 Christianview Delaware lambdavid@gmail.com \n", "1 Saint Anastasie-les-Bains AR berthelotjacqueline@mahe.fr \n", "2 Stanleyfurt ZA middletonconor@hotmail.com \n", "3 Neubrandenburg Berlin sophia01@yahoo.de \n", "4 Carlosmouth Texas tross@jensen-ware.org \n", "... ... ... ... \n", "2075 Luckenwalde Hamburg gutknechtevelyn@niemeier.com \n", "2076 Mooneybury Maryland ahleythoa@ail.co \n", "2077 Duval-les-Bains CH aruiz@reynaud.fr \n", "2078 Huesca Asturias manuelamosquera@yahoo.com \n", "2079 Bradfordborough North Dakota lcruz@gmail.com \n", "\n", " user_name \n", "0 ndavidson \n", "1 manonallain \n", "2 colemanmichael \n", "3 romanjunitz \n", "4 adam78 \n", "... ... \n", "2075 dkreusel \n", "2076 kennethrchn \n", "2077 dorothee41 \n", "2078 eugenia17 \n", "2079 cnewton \n", "\n", "[2080 rows x 8 columns]" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers" ] }, { "cell_type": "markdown", "id": "c4c0955c-a8cc-4fbe-b304-cecc97f41bea", "metadata": {}, "source": [ "### 1.2.2. Datentypen anzeigen \n", "Hierfür verwenden wir pandas.DataFrame.dtypes:" ] }, { "cell_type": "code", "execution_count": 4, "id": "578db81c-f9ad-47c1-bf1f-c4183e6498f4", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:31.950705Z", "iopub.status.busy": "2026-03-24T17:28:31.950424Z", "iopub.status.idle": "2026-03-24T17:28:31.955728Z", "shell.execute_reply": "2026-03-24T17:28:31.954998Z", "shell.execute_reply.started": "2026-03-24T17:28:31.950683Z" } }, "outputs": [ { "data": { "text/plain": [ "name str\n", "job str\n", "company str\n", "street_address str\n", "city str\n", "state str\n", "email str\n", "user_name str\n", "dtype: object" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers.dtypes" ] }, { "cell_type": "markdown", "id": "c58db531-b981-43a2-ac1b-319eb321c767", "metadata": {}, "source": [ "### 1.2.3. Fehlende Werte ermitteln\n", "pandas.isnull zeigt für ein array-ähnliches Objekt an, ob Werte fehlen:\n", "\n", "- NaN in numerischen Arrays\n", "\n", "- None oder NaN in Objekt-Arrays\n", "\n", "- NaT in datetimelike\n", "\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "0540e3f9-4db0-4b91-9f7f-878f53dc1a44", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:33.189746Z", "iopub.status.busy": "2026-03-24T17:28:33.189443Z", "iopub.status.idle": "2026-03-24T17:28:33.194748Z", "shell.execute_reply": "2026-03-24T17:28:33.194120Z", "shell.execute_reply.started": "2026-03-24T17:28:33.189724Z" }, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "name 0\n", "job 0\n", "company 0\n", "street_address 0\n", "city 0\n", "state 0\n", "email 0\n", "user_name 0\n" ] } ], "source": [ "for col in customers.columns:\n", " print(col, customers[col].isnull().sum())" ] }, { "cell_type": "markdown", "id": "799d56c7-a0e3-46a3-943c-befaacc5a74d", "metadata": {}, "source": [ "## 2. Duplizierte Datensätze ermitteln und löschen\n", "### 2.1. mit Pandas " ] }, { "cell_type": "markdown", "id": "a53120ae-2f8a-4bb5-b7a3-4cb3e96f55c0", "metadata": {}, "source": [ "#### 2.1.1 Duplizierte Datenpunkte ermitteln" ] }, { "cell_type": "code", "execution_count": 6, "id": "1ae2365b-d980-409f-9f36-e2cee7525a35", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:35.210563Z", "iopub.status.busy": "2026-03-24T17:28:35.210052Z", "iopub.status.idle": "2026-03-24T17:28:35.220555Z", "shell.execute_reply": "2026-03-24T17:28:35.219884Z", "shell.execute_reply.started": "2026-03-24T17:28:35.210519Z" } }, "outputs": [ { "data": { "text/plain": [ "0 False\n", "1 False\n", "2 False\n", "3 False\n", "4 False\n", " ... \n", "2075 False\n", "2076 False\n", "2077 False\n", "2078 False\n", "2079 False\n", "Length: 2080, dtype: bool" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers.duplicated()" ] }, { "cell_type": "markdown", "id": "30a1a1c0-839d-458c-8cce-853895cff9a8", "metadata": {}, "source": [ "customers.duplicated() gibt uns noch nicht den gewünschten Hinweis, ob es doppelte Datensätze gibt. Im Folgenden lassen wir uns alle Datensätze ausgeben, für die True zurückgegeben wird:\n" ] }, { "cell_type": "code", "execution_count": 7, "id": "8857b4f6-ec55-4c24-8adc-1645eca3c1c1", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:36.594110Z", "iopub.status.busy": "2026-03-24T17:28:36.593605Z", "iopub.status.idle": "2026-03-24T17:28:36.605758Z", "shell.execute_reply": "2026-03-24T17:28:36.604862Z", "shell.execute_reply.started": "2026-03-24T17:28:36.594067Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namejobcompanystreet_addresscitystateemailuser_name
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [name, job, company, street_address, city, state, email, user_name]\n", "Index: []" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers[customers.duplicated()]" ] }, { "cell_type": "markdown", "id": "5b07a4fd-a06c-4d6e-8b31-9600ff5b5726", "metadata": {}, "source": [ "Offenbar gibt es keine identischen Datensätze.\n", "\n" ] }, { "cell_type": "markdown", "id": "bd1c85e5-526b-4831-abba-77cab5d0d095", "metadata": {}, "source": [ "### 2.1.2. Deduplizieren \n", "\n", "Da pandas keine Duplikate erkennen konnte, wird das Löschen doppelter Datensätzemit drop_duplicates demnach nichts ändern und die Anzahl des Datensatzes bei 2080 belassen.\n", "\n" ] }, { "cell_type": "code", "execution_count": 8, "id": "757fa2fb-381c-4a92-b001-6ddf0ea51d7e", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:39.022869Z", "iopub.status.busy": "2026-03-24T17:28:39.022605Z", "iopub.status.idle": "2026-03-24T17:28:39.029103Z", "shell.execute_reply": "2026-03-24T17:28:39.028433Z", "shell.execute_reply.started": "2026-03-24T17:28:39.022851Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2080, 8)\n", "(2080, 8)\n" ] } ], "source": [ "print(customers.shape)\n", "customers.drop_duplicates(inplace=True)\n", "print(customers.shape)" ] }, { "cell_type": "markdown", "id": "349b31db-dbe8-4fd1-b7a1-a6dea4192c04", "metadata": {}, "source": [ "### 2.1.3. user_name-spezifische Ermittlung und Deduplizierung" ] }, { "cell_type": "markdown", "id": "a219298d-9998-4540-9b9e-a7af85b4d47f", "metadata": {}, "source": [ "Wir gehen einen Detailschritt weiter und lassen uns diejenigen Datensätze anzeigen, bei denen nur der user_name identisch ist:" ] }, { "cell_type": "code", "execution_count": 9, "id": "950221bf-7983-4bc7-bb80-8f250b6903fe", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:42.452186Z", "iopub.status.busy": "2026-03-24T17:28:42.451871Z", "iopub.status.idle": "2026-03-24T17:28:42.470302Z", "shell.execute_reply": "2026-03-24T17:28:42.469639Z", "shell.execute_reply.started": "2026-03-24T17:28:42.452165Z" }, "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namejobcompanystreet_addresscitystateemailuser_name
337Aysel BinnerReccig officerKuhl Kalleww Swifwunw & Co. KGaABatix-Kanz-Staß 5/4FuldaBerlifrncoise@wgnercochristinefinke
377Jolanta RoggeAccommodation managrScholl e.V.Lrchplz 4/6MettmnnThüringeninrharff@yah.dwalentinabeier
506Mrs. Frances PetersFuiue desieRsgers, Lawrence and RichardsStudio \\nCarpntr kysWes SimnBOhalenewilliams@wilson-sandes.ogamy17
545Gerhart Krebs MBA.SurgeonRoskothKühnertweg 863StadeBayernolav44@bolander.debettyhahn
592Folkert GnatzMeteorologistBolnbachHeinfried-Austermühle-Ring 05EilenburgThüringenjaentschbirgitt@boerner.orgfrancesco44
633Manon JacquotIngénieur en aéronautiqueJacob8, chemin Éléonore EvrardMarechal-les-BainsARilemaitre@voila.frastrid58
658Austin WallerInsurance risk surveyorSexton Group11097 Hansen FieldDavidmouthTexaschristina74@doyle-baker.bizolynn
723Wanda MoranSolicitor, ScotlandEstes PLC08011 Hernandez Streets Apt. 149NatalieshireOregonhowardreginald@gmail.comdana91
762Charles RussellScientist, research (physical sciences)Preston-Wilson6709 Ashley Circle Apt. 309DanielbergSouth Dakotanancyescobar@brown.netruben71
772Waltrud WohlgemutDesigner, fashion/clothingNerger AGElmar-Ullmann-Allee 6SchlüchternRheinland-Pfalzauch-schlauchindietlind@gmx.dezitakuhl
783Caroline MataEngineer, elecricalGrimes Grrur80157 Whte Alley Sute 79Soh MarkIwjared52@aoo.comthomasthompson
889Ricardo Ripoll LucenaTeevisi camera peratrLuzq Estraqa anq GalinqqCaejón Rosario Viapana 16PalenciaLgoev0@oo.comcolomerenrique
928Sophie Letellier du CarpentierCnucteu e étValle7 SARL3, boulvard Jan AugrSaint DaviddanBSrdorm@dbmi.comanne28
979Irene Roda DávilaEitor, maazine featresDaza IncRoda Carla Miró 5ViyLa Rioasldrpére@ps.cmipeñalver
995Abigail HernandezMechanical engineerSmith Ltd766 Adrian RanchEllismouthColoradojordan60@gmail.commendozajody
1015Mr. Paul NewtonGovernment soa researh offerLemnardmWatsmnStudi 86\\nKaty illWest JueVEem@mil.cmbbennett
1043Anna AdamsProgrammer alcatonsJones Gjoup22 Kateen ovaNoth JoaKZasleig65@aisay.colloydann
1052Aurélie VidalMagistratMartins88, rue Stéphanie LetellierRouxnecSEboutineric@blin.friwagner
1062Regina Schacht-KuschHerbalistHartung GmbH & Co. KGaAWenke-Hörle-Ring 36EggenfeldenSachsen-Anhaltoluebs@troest.dexklotz
1120Jeffrey BenjaminPubl house managerChcn Inc27 Rodgrs Rdgs Apt. 269Suth JefferergIinoisstepanie90@rogers.colori67
1170Julio Agustín AmayaTax aviserPiñolk Belmonke and CodinaCalleón de Gregorio Bustamante 28 Piso 7La PalaSalamancausolana@jáuregui-pedraza.omgloriaolmo
1339Ing. Andrew Schleich B.A.LnHolt Putz GnRHugasse 8/8HainichnNeersachsenjun@putz.comjesselmaja
1360Frédérique Lejeune-DanielTecce cseSctmittchemin Denise FerrandSaint ChalotteVilleIEjchretien@costacomjoseph60
1384Kenneth MooreMagazine journalistCross, Bfll anf Diaz753 Lindsey PineThompsonsheColoraoashey28@rice.cotodd72
1423Thomas CoulonCollecteur de fondsLevy91, rue Laetitia ColletDias-sur-NormandSCdeschampsgabriel@guyot.frmichelepetit
1433Jerry BarnesTour mnerCol-Wllllams30 Mpy OvepassJeiferviewUtahinsnashl@gas-hais.cmchristopher62
1452Karen WeeksPsychotherapist, childRodriguez, Brady and Jackson233 Kevin StreetLarrysideIndianagregg39@hernandez-gomez.comknapprobert
1489Herr Johann EigenwilligImmigration officerSüßebier Hänel GmbHLangernplatz 0StadtsteinachThüringenhaasemarieluise@noack.comistoll
1544Pasquale SchwitalTrade mark attorneyFinkeDetlef-Binner-Platz 0/1BurgNiedersachsenhanne-lore98@gmx.dethomas14
1557Stephanie YoungHerpetologistBryant and Sons5163 Rebecca Creek Suite 421North TheresabergAlaskastephenwilliams@summers.comahawkins
1567Carolina Reguera SanzFam manaeCami77, C7aparr7 a7d N7gu7raVil e Imel Oorio 25MaddVicayamordóñ@cámara.infoeva16
1616Sonia AmoresSenir tax prfessina/tax inspectrJ5an-NúñezAvnida d Grgorio Manón 344 Prta 8PonevedrLugoicent4@montenero-brroso.infosanmartínguillermo
1647Juan Carlos Iker Boix RosPre phtgrapherPont, P44om4r4s 4nd Arjon4Pasadzo de Josep Bentez  PsoLas PalmasMiasrgio24@gail.coluis-miguel23
1652Jörg HenschelChaity officeSchicke AGHennyLorchRng 484Hohensein-EnshBadenWürtteberghuerhes@hmal.deanne-katrin51
1703Marc TateShip brokerWagner, Mitchell and Grimes721 Christopher View Suite 840WatsonmouthConnecticutchenjessica@hotmail.compatricia34
1707Joseph HinesPyhiatri nreCr4ig, G4rci4 4nd Rich4rds85663 Savage GlesMcgeeonAlsbcaldern@htmail.cmemilytorres
1722Julie BaldwinSet deignerW5ll55mson-G5rz558513 Paricia Res Suie 45So MeAlaskadiuez@uess.cmoss
1759Sarah HoffmanExhibitin designeHensont Wiley and Ryan9490 Curts Spur Sute 82JseptwnArizonancole@yahoo.comcsmith
1796Valentine Devaux-RogerDireceur d'ôialLeiris57, enue de GrosBenadBouALrogrlro@munoz.omxherve
1809Slavica SeidelPsychotherapist, childWulff Hande KGPreißgasse 0/4SoestRheinland-Pfalztloos@krause.netabien
1820Wenke SchweitzerEnginr, automotiWesa4k KGEies. 7Ba LnwraThürigersthveriue@mies.rgkwernecke
1829Dr. Thomas HeinCopyGeiselLadeckgasse 11RockenhausenNordrhein-Westfalengrein-grotharnim@kallert.desiegmar08
1837Andrew HartEngineer, civil (contracting)Barnett LLC258 Day Hollow Suite 410KimberlyhavenColoradobrandy00@yahoo.comamy30
1914Shelby FowlerAir traffic controllerFields-Sanchez533 Fitzpatrick BypassFrancesbergMichiganterrystephen@anderson.orggcain
1938Susan AubryDirecteur d'agence bancairePayet Georges S.A.S.67, rue Inès ValentinNicolasFImilletedith@sfr.frtthierry
1948Richard Karge-KobeltJunalist maaineAbberb Keubeb AGMitschkeee 8SachsnAnhaltnrejwgner@gmx.emuehlehenni
1960Anna de LobatoMedcl techcl ocerMaciag PLCCalleón de Dolore Parea 21 At 7PalnciaCantariavázqzlornzo@al.omdaniel70
1968Zoltan Wähner B.A.Professor EmeritsTh8e8Stotr. 1SaulgauShlsg-Holstarlenpruschke@salz.orkklemm
1995Kenneth DunnProgrammer, systemsLeonard Inc5361 Patterson Mission Suite 504VillaburghRhode Islandkristen54@gmail.comjkent
2010Gertraude SchomberInsurance risk surveyorBruderChrista-Ullrich-Allee 0/1Schwäbisch HallHessengumprichalice@schmidt.defruppert
2075Maurice SteySystems developerLinke Margraf GmbH & Co. OHGLaila-Scheibe-Allee 2/0LuckenwaldeHamburggutknechtevelyn@niemeier.comdkreusel
\n", "
" ], "text/plain": [ " name job \\\n", "337 Aysel Binner Reccig officer \n", "377 Jolanta Rogge Accommodation managr \n", "506 Mrs. Frances Peters Fuiue desie \n", "545 Gerhart Krebs MBA. Surgeon \n", "592 Folkert Gnatz Meteorologist \n", "633 Manon Jacquot Ingénieur en aéronautique \n", "658 Austin Waller Insurance risk surveyor \n", "723 Wanda Moran Solicitor, Scotland \n", "762 Charles Russell Scientist, research (physical sciences) \n", "772 Waltrud Wohlgemut Designer, fashion/clothing \n", "783 Caroline Mata Engineer, elecrical \n", "889 Ricardo Ripoll Lucena Teevisi camera peratr \n", "928 Sophie Letellier du Carpentier Cnucteu e ét \n", "979 Irene Roda Dávila Eitor, maazine featres \n", "995 Abigail Hernandez Mechanical engineer \n", "1015 Mr. Paul Newton Government soa researh offer \n", "1043 Anna Adams Programmer alcatons \n", "1052 Aurélie Vidal Magistrat \n", "1062 Regina Schacht-Kusch Herbalist \n", "1120 Jeffrey Benjamin Publ house manager \n", "1170 Julio Agustín Amaya Tax aviser \n", "1339 Ing. Andrew Schleich B.A. Ln \n", "1360 Frédérique Lejeune-Daniel Tecce cse \n", "1384 Kenneth Moore Magazine journalist \n", "1423 Thomas Coulon Collecteur de fonds \n", "1433 Jerry Barnes Tour mner \n", "1452 Karen Weeks Psychotherapist, child \n", "1489 Herr Johann Eigenwillig Immigration officer \n", "1544 Pasquale Schwital Trade mark attorney \n", "1557 Stephanie Young Herpetologist \n", "1567 Carolina Reguera Sanz Fam manae \n", "1616 Sonia Amores Senir tax prfessina/tax inspectr \n", "1647 Juan Carlos Iker Boix Ros Pre phtgrapher \n", "1652 Jörg Henschel Chaity office \n", "1703 Marc Tate Ship broker \n", "1707 Joseph Hines Pyhiatri nre \n", "1722 Julie Baldwin Set deigner \n", "1759 Sarah Hoffman Exhibitin designe \n", "1796 Valentine Devaux-Roger Direceur d'ôial \n", "1809 Slavica Seidel Psychotherapist, child \n", "1820 Wenke Schweitzer Enginr, automoti \n", "1829 Dr. Thomas Hein Copy \n", "1837 Andrew Hart Engineer, civil (contracting) \n", "1914 Shelby Fowler Air traffic controller \n", "1938 Susan Aubry Directeur d'agence bancaire \n", "1948 Richard Karge-Kobelt Junalist maaine \n", "1960 Anna de Lobato Medcl techcl ocer \n", "1968 Zoltan Wähner B.A. Professor Emerits \n", "1995 Kenneth Dunn Programmer, systems \n", "2010 Gertraude Schomber Insurance risk surveyor \n", "2075 Maurice Stey Systems developer \n", "\n", " company \\\n", "337 Kuhl Kalleww Swifwunw & Co. KGaA \n", "377 Scholl e.V. \n", "506 Rsgers, Lawrence and Richards \n", "545 Roskoth \n", "592 Bolnbach \n", "633 Jacob \n", "658 Sexton Group \n", "723 Estes PLC \n", "762 Preston-Wilson \n", "772 Nerger AG \n", "783 Grimes Grrur \n", "889 Luzq Estraqa anq Galinqq \n", "928 Valle7 SARL \n", "979 Daza Inc \n", "995 Smith Ltd \n", "1015 LemnardmWatsmn \n", "1043 Jones Gjoup \n", "1052 Martins \n", "1062 Hartung GmbH & Co. KGaA \n", "1120 Chcn Inc \n", "1170 Piñolk Belmonke and Codina \n", "1339 Holt Putz GnR \n", "1360 Sctmitt \n", "1384 Cross, Bfll anf Diaz \n", "1423 Levy \n", "1433 Col-Wllllams \n", "1452 Rodriguez, Brady and Jackson \n", "1489 Süßebier Hänel GmbH \n", "1544 Finke \n", "1557 Bryant and Sons \n", "1567 Cami77, C7aparr7 a7d N7gu7ra \n", "1616 J5an-Núñez \n", "1647 Pont, P44om4r4s 4nd Arjon4 \n", "1652 Schicke AG \n", "1703 Wagner, Mitchell and Grimes \n", "1707 Cr4ig, G4rci4 4nd Rich4rds \n", "1722 W5ll55mson-G5rz5 \n", "1759 Hensont Wiley and Ryan \n", "1796 Leiris \n", "1809 Wulff Hande KG \n", "1820 Wesa4k KG \n", "1829 Geisel \n", "1837 Barnett LLC \n", "1914 Fields-Sanchez \n", "1938 Payet Georges S.A.S. \n", "1948 Abberb Keubeb AG \n", "1960 Maciag PLC \n", "1968 Th8e8 \n", "1995 Leonard Inc \n", "2010 Bruder \n", "2075 Linke Margraf GmbH & Co. OHG \n", "\n", " street_address city \\\n", "337 Batix-Kanz-Staß 5/4 Fulda \n", "377 Lrchplz 4/6 Mettmnn \n", "506 Studio \\nCarpntr kys Wes Simn \n", "545 Kühnertweg 863 Stade \n", "592 Heinfried-Austermühle-Ring 05 Eilenburg \n", "633 8, chemin Éléonore Evrard Marechal-les-Bains \n", "658 11097 Hansen Field Davidmouth \n", "723 08011 Hernandez Streets Apt. 149 Natalieshire \n", "762 6709 Ashley Circle Apt. 309 Danielberg \n", "772 Elmar-Ullmann-Allee 6 Schlüchtern \n", "783 80157 Whte Alley Sute 79 Soh Mark \n", "889 Caejón Rosario Viapana 16 Palencia \n", "928 3, boulvard Jan Augr Saint Daviddan \n", "979 Roda Carla Miró 5 Viy \n", "995 766 Adrian Ranch Ellismouth \n", "1015 Studi 86\\nKaty ill West Jue \n", "1043 22 Kateen ova Noth Joa \n", "1052 88, rue Stéphanie Letellier Rouxnec \n", "1062 Wenke-Hörle-Ring 36 Eggenfelden \n", "1120 27 Rodgrs Rdgs Apt. 269 Suth Jeffererg \n", "1170 Calleón de Gregorio Bustamante 28 Piso 7 La Pala \n", "1339 Hugasse 8/8 Hainichn \n", "1360 chemin Denise Ferrand Saint ChalotteVille \n", "1384 753 Lindsey Pine Thompsonshe \n", "1423 91, rue Laetitia Collet Dias-sur-Normand \n", "1433 30 Mpy Ovepass Jeiferview \n", "1452 233 Kevin Street Larryside \n", "1489 Langernplatz 0 Stadtsteinach \n", "1544 Detlef-Binner-Platz 0/1 Burg \n", "1557 5163 Rebecca Creek Suite 421 North Theresaberg \n", "1567 Vil e Imel Oorio 25 Madd \n", "1616 Avnida d Grgorio Manón 344 Prta 8 Ponevedr \n", "1647 Pasadzo de Josep Bentez Pso Las Palmas \n", "1652 HennyLorchRng 484 Hohensein-Ensh \n", "1703 721 Christopher View Suite 840 Watsonmouth \n", "1707 85663 Savage Gles Mcgeeon \n", "1722 58513 Paricia Res Suie 45 So Me \n", "1759 9490 Curts Spur Sute 82 Jseptwn \n", "1796 57, enue de Gros BenadBou \n", "1809 Preißgasse 0/4 Soest \n", "1820 Eies. 7 Ba Lnwra \n", "1829 Ladeckgasse 11 Rockenhausen \n", "1837 258 Day Hollow Suite 410 Kimberlyhaven \n", "1914 533 Fitzpatrick Bypass Francesberg \n", "1938 67, rue Inès Valentin Nicolas \n", "1948 Mitschkeee 8 Mß \n", "1960 Calleón de Dolore Parea 21 At 7 Palncia \n", "1968 Stotr. 1 Saulgau \n", "1995 5361 Patterson Mission Suite 504 Villaburgh \n", "2010 Christa-Ullrich-Allee 0/1 Schwäbisch Hall \n", "2075 Laila-Scheibe-Allee 2/0 Luckenwalde \n", "\n", " state email user_name \n", "337 Berli frncoise@wgnerco christinefinke \n", "377 Thüringen inrharff@yah.d walentinabeier \n", "506 BO halenewilliams@wilson-sandes.og amy17 \n", "545 Bayern olav44@bolander.de bettyhahn \n", "592 Thüringen jaentschbirgitt@boerner.org francesco44 \n", "633 AR ilemaitre@voila.fr astrid58 \n", "658 Texas christina74@doyle-baker.biz olynn \n", "723 Oregon howardreginald@gmail.com dana91 \n", "762 South Dakota nancyescobar@brown.net ruben71 \n", "772 Rheinland-Pfalz auch-schlauchindietlind@gmx.de zitakuhl \n", "783 Iw jared52@aoo.com thomasthompson \n", "889 Lgo ev0@oo.com colomerenrique \n", "928 BS rdorm@dbmi.com anne28 \n", "979 La Rioa sldrpére@ps.cm ipeñalver \n", "995 Colorado jordan60@gmail.com mendozajody \n", "1015 VE em@mil.cm bbennett \n", "1043 KZ asleig65@aisay.co lloydann \n", "1052 SE boutineric@blin.fr iwagner \n", "1062 Sachsen-Anhalt oluebs@troest.de xklotz \n", "1120 Iinois stepanie90@rogers.co lori67 \n", "1170 Salamanca usolana@jáuregui-pedraza.om gloriaolmo \n", "1339 Neersachsen jun@putz.com jesselmaja \n", "1360 IE jchretien@costacom joseph60 \n", "1384 Colorao ashey28@rice.co todd72 \n", "1423 SC deschampsgabriel@guyot.fr michelepetit \n", "1433 Utah insnashl@gas-hais.cm christopher62 \n", "1452 Indiana gregg39@hernandez-gomez.com knapprobert \n", "1489 Thüringen haasemarieluise@noack.com istoll \n", "1544 Niedersachsen hanne-lore98@gmx.de thomas14 \n", "1557 Alaska stephenwilliams@summers.com ahawkins \n", "1567 Vicaya mordóñ@cámara.info eva16 \n", "1616 Lugo icent4@montenero-brroso.info sanmartínguillermo \n", "1647 Mia srgio24@gail.co luis-miguel23 \n", "1652 BadenWürtteberg huerhes@hmal.de anne-katrin51 \n", "1703 Connecticut chenjessica@hotmail.com patricia34 \n", "1707 Als bcaldern@htmail.cm emilytorres \n", "1722 Alaska diuez@uess. cmoss \n", "1759 Arizona ncole@yahoo.com csmith \n", "1796 AL rogrlro@munoz.om xherve \n", "1809 Rheinland-Pfalz tloos@krause.net abien \n", "1820 Thürige rsthveriue@mies.rg kwernecke \n", "1829 Nordrhein-Westfalen grein-grotharnim@kallert.de siegmar08 \n", "1837 Colorado brandy00@yahoo.com amy30 \n", "1914 Michigan terrystephen@anderson.org gcain \n", "1938 FI milletedith@sfr.fr tthierry \n", "1948 SachsnAnhalt nrejwgner@gmx.e muehlehenni \n", "1960 Cantaria vázqzlornzo@al.om daniel70 \n", "1968 Shlsg-Holst arlenpruschke@salz.or kklemm \n", "1995 Rhode Island kristen54@gmail.com jkent \n", "2010 Hessen gumprichalice@schmidt.de fruppert \n", "2075 Hamburg gutknechtevelyn@niemeier.com dkreusel " ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers[customers.duplicated([\"user_name\"])]" ] }, { "cell_type": "markdown", "id": "f268d35d-797f-49f3-9d24-63a83f62fc59", "metadata": {}, "source": [ "Ein darauf den user_name gefilterter Datensatz kann so aussehen:\n" ] }, { "cell_type": "code", "execution_count": 10, "id": "46e76915-de2b-4227-aba6-0d53c43b651b", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:44.054506Z", "iopub.status.busy": "2026-03-24T17:28:44.053928Z", "iopub.status.idle": "2026-03-24T17:28:44.067363Z", "shell.execute_reply": "2026-03-24T17:28:44.066668Z", "shell.execute_reply.started": "2026-03-24T17:28:44.054454Z" } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
namejobcompanystreet_addresscitystateemailuser_name
236Aysel BinnerRecycling officerKuhl Kallert Stiftung & Co. KGaABeatrix-Kranz-Straße 5/4FuldaBerlinfrancoise22@wagner.comchristinefinke
337Aysel BinnerReccig officerKuhl Kalleww Swifwunw & Co. KGaABatix-Kanz-Staß 5/4FuldaBerlifrncoise@wgnercochristinefinke
\n", "
" ], "text/plain": [ " name job company \\\n", "236 Aysel Binner Recycling officer Kuhl Kallert Stiftung & Co. KGaA \n", "337 Aysel Binner Reccig officer Kuhl Kalleww Swifwunw & Co. KGaA \n", "\n", " street_address city state email \\\n", "236 Beatrix-Kranz-Straße 5/4 Fulda Berlin francoise22@wagner.com \n", "337 Batix-Kanz-Staß 5/4 Fulda Berli frncoise@wgnerco \n", "\n", " user_name \n", "236 christinefinke \n", "337 christinefinke " ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "customers[customers[\"user_name\"] == \"christinefinke\"]\n" ] }, { "cell_type": "markdown", "id": "cd8ef668-a32a-41d7-840d-268baa03b2f7", "metadata": {}, "source": [ "Nun können wir diejenigen Datenpunkte löschen, deren user_name identisch ist:" ] }, { "cell_type": "code", "execution_count": 11, "id": "fe3c078d-e6cc-403d-9443-9415a798327d", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T17:28:45.333540Z", "iopub.status.busy": "2026-03-24T17:28:45.333277Z", "iopub.status.idle": "2026-03-24T17:28:45.339020Z", "shell.execute_reply": "2026-03-24T17:28:45.338309Z", "shell.execute_reply.started": "2026-03-24T17:28:45.333520Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2080, 8)\n", "(2029, 8)\n" ] } ], "source": [ "print(customers.shape)\n", "customers.drop_duplicates([\"user_name\"], inplace=True)\n", "print(customers.shape)" ] }, { "cell_type": "markdown", "id": "65a4323b-c93d-447a-82fd-45b0c4d85532", "metadata": {}, "source": [ "Dies hat 2080-2029=51 Original-Datenpunkte gelöscht." ] }, { "cell_type": "markdown", "id": "78474f25-57ce-42e0-bed0-7f10fd6d61cc", "metadata": {}, "source": [ "## 2.2. Deduplizieren mit fuzzywuzzy\n", "\n", "Als Alternative und zum Vergleich fürs Identifizieren von Duplikation und das entsprechende Deduplizieren schauen wir uns die Bibliothek fuzzywuzzy an: \n", "\n", "https://www.python4data.science/de/latest/clean-prep/string-matching.html\n" ] }, { "cell_type": "markdown", "id": "31089ed7-9186-4c8d-86ff-5542cd444b1f", "metadata": {}, "source": [ "## 2.3. Deduplizieren mit Dedupe\n", "\n", "Alternativ können wir die duplizierte Daten mit der Dedupe-Bibliothek (https://docs.dedupe.io/en/latest/) erkennen, die ein flaches neuronales Netzwerk verwendet, um aus einem kleinen Training zu lernen.\n", "\n", "Das entsprechende Notebook finden wir hier: \n", "\n", "https://www.python4data.science/de/latest/clean-prep/deduplicate.html#3.-Dedupe" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.0" }, "widgets": { "application/vnd.jupyter.widget-state+json": { "state": {}, "version_major": 2, "version_minor": 0 } } }, "nbformat": 4, "nbformat_minor": 5 }