{ "cells": [ { "cell_type": "markdown", "id": "11bb4456-5dc2-440c-9320-4af43e032aeb", "metadata": {}, "source": [ "# Hands-on Beispiel LLM (2)" ] }, { "cell_type": "markdown", "id": "47a74c0f-6038-4871-9038-df59a4be02a1", "metadata": {}, "source": [ "### 2. Fine-tuning - Anpassung an juristische Fachtexte\n", "##### --- Juristische Fragen an ein fine-tuned Modell (Lokale LLM)\n", "\n", "In diesem Abschnitt fine-tunen wir das Modell `dbmdz/german-gpt2` und stellen ihm die gleichen zwei juristischen Fragen zum AI Act wie im Baseline-Notebook.\n", "\n", "Ziel ist es, dass das feingetunte Modell (llm-2) nun fundiertere und korrekte Antworten liefert.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "8f0f45fe-361d-4965-a821-e69419fced67", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:06.354720Z", "iopub.status.busy": "2026-03-24T18:13:06.354568Z", "iopub.status.idle": "2026-03-24T18:13:06.359355Z", "shell.execute_reply": "2026-03-24T18:13:06.357651Z", "shell.execute_reply.started": "2026-03-24T18:13:06.354702Z" } }, "outputs": [], "source": [ "# falls noch nicht installiert \n", "\n", "import sys\n", "# !{sys.executable} -m pip install transformers datasets\n", "# !{sys.executable} -m pip install 'accelerate>=1.10.0'" ] }, { "cell_type": "code", "execution_count": 2, "id": "7e552b09-89c0-4168-8e6c-c59babe09ea5", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:06.360112Z", "iopub.status.busy": "2026-03-24T18:13:06.359959Z", "iopub.status.idle": "2026-03-24T18:13:10.070226Z", "shell.execute_reply": "2026-03-24T18:13:10.069715Z", "shell.execute_reply.started": "2026-03-24T18:13:06.360096Z" } }, "outputs": [], "source": [ "import torch\n", "from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling\n", "from datasets import Dataset" ] }, { "cell_type": "code", "execution_count": 3, "id": "6b807925-978d-4d97-8765-ad2beaf99097", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:10.070728Z", "iopub.status.busy": "2026-03-24T18:13:10.070543Z", "iopub.status.idle": "2026-03-24T18:13:12.602873Z", "shell.execute_reply": "2026-03-24T18:13:12.602349Z", "shell.execute_reply.started": "2026-03-24T18:13:10.070719Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.\n" ] }, { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "00edbe7a58d546d586443954281f376b", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Loading weights: 0%| | 0/148 [00:00, ?it/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stderr", "output_type": "stream", "text": [ "\u001b[1mGPT2LMHeadModel LOAD REPORT\u001b[0m from: dbmdz/german-gpt2\n", "Key | Status | | \n", "----------------------------------------+------------+--+-\n", "transformer.h.{0...11}.attn.masked_bias | UNEXPECTED | | \n", "\n", "\u001b[3mNotes:\n", "- UNEXPECTED\u001b[3m\t:can be ignored when loading from different task/architecture; not ok if you expect identical arch.\u001b[0m\n" ] } ], "source": [ "# Lade Modell und Tokenizer (das Basis-Modell bleibt identisch)\n", "model_name = \"dbmdz/german-gpt2\" \n", "tokenizer = AutoTokenizer.from_pretrained(model_name)\n", "model = AutoModelForCausalLM.from_pretrained(model_name)\n", "model = model.bfloat16()\n", "\n", "# Da GPT-2-Modelle oft keinen expliziten Padding-Token besitzen, setzen wir hier den EOS-Token als Padding-Token.\n", "tokenizer.pad_token = tokenizer.eos_token\n", "\n", "# Konfiguriere pad_token_id im Modell \n", "# (braucht man, wenn das Modell noch nicht standardmäßig für den Umgang mit dem Padding-Token eingestellt ist)\n", "model.config.pad_token_id = tokenizer.eos_token_id\n", "model.generation_config.pad_token_id = tokenizer.pad_token_id" ] }, { "cell_type": "markdown", "id": "b9d3dcaf-e281-4c42-a60c-069a8812b65c", "metadata": {}, "source": [ "## Domänenspezifischer Datensatz: Auszüge aus dem AI Act\n", "Wir extrahieren zwei wichtige Absätze aus dem AI Act, die juristische Fachtermini und Anforderungen beinhalten. \n", "\n", "Hinweis: Die folgenden Textabschnitte sind exemplarisch und basieren auf öffentlich zugänglichen Informationen zum AI Act, z.B.: \"https://eur-lex.europa.eu/legal-content/DE/TXT/?uri=CELEX:32024R1689\"" ] }, { "cell_type": "code", "execution_count": 4, "id": "035f7eae-df3e-4b98-8f5d-4bec1fcc71e1", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:12.603307Z", "iopub.status.busy": "2026-03-24T18:13:12.603223Z", "iopub.status.idle": "2026-03-24T18:13:12.605570Z", "shell.execute_reply": "2026-03-24T18:13:12.605113Z", "shell.execute_reply.started": "2026-03-24T18:13:12.603297Z" } }, "outputs": [], "source": [ "# Ausgewählte Absätze aus dem AI Act (Beispiele)\n", "ai_act_texts = [\n", " \"\"\"Artikel 1 – Anwendungsbereich: \n", " Diese Verordnung gilt für KI-Systeme, die in der Europäischen Union in Verkehr gebracht oder in Betrieb genommen werden, und legt die grundlegenden Anforderungen an Sicherheit, Transparenz und Verantwortlichkeit fest.\"\"\",\n", " \"\"\"Artikel 2 – Risikoklassifizierung: \n", " KI-Systeme werden in Abhängigkeit von ihrem potenziellen Risiko in verschiedene Kategorien eingeteilt. Hochrisiko-KI-Systeme unterliegen strengen Anforderungen an ihre Konzeption, Entwicklung und den Betrieb, um die Sicherheit und den Schutz der Grundrechte zu gewährleisten.\"\"\"\n", "]" ] }, { "cell_type": "code", "execution_count": 5, "id": "23bc7dc9-f3e6-41bd-8e9f-f8f1c5aae526", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:12.605992Z", "iopub.status.busy": "2026-03-24T18:13:12.605920Z", "iopub.status.idle": "2026-03-24T18:13:12.610887Z", "shell.execute_reply": "2026-03-24T18:13:12.610265Z", "shell.execute_reply.started": "2026-03-24T18:13:12.605985Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Domänenspezifischer Datensatz erstellt:\n", "Dataset({\n", " features: ['text'],\n", " num_rows: 2\n", "})\n" ] } ], "source": [ "# Erstelle ein Dataset aus den Auszügen\n", "data_dict = {\"text\": ai_act_texts}\n", "dataset = Dataset.from_dict(data_dict)\n", "print(\"Domänenspezifischer Datensatz erstellt:\")\n", "print(dataset)" ] }, { "cell_type": "code", "execution_count": 6, "id": "9e420718-e911-4b18-a316-65f2ac1f7b8b", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:12.611427Z", "iopub.status.busy": "2026-03-24T18:13:12.611322Z", "iopub.status.idle": "2026-03-24T18:13:12.853989Z", "shell.execute_reply": "2026-03-24T18:13:12.853381Z", "shell.execute_reply.started": "2026-03-24T18:13:12.611419Z" } }, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "95c9f49392ed4e91ad78d0acaf713cdc", "version_major": 2, "version_minor": 0 }, "text/plain": [ "Map: 0%| | 0/2 [00:00, ? examples/s]" ] }, "metadata": {}, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "Tokenisierter Datensatz:\n", "Dataset({\n", " features: ['input_ids', 'attention_mask'],\n", " num_rows: 2\n", "})\n" ] } ], "source": [ "# %% [code]\n", "# Tokenisiere den Datensatz\n", "def tokenize_function(example):\n", " return tokenizer(example[\"text\"], truncation=True, padding=\"max_length\", max_length=256)\n", "\n", "tokenized_dataset = dataset.map(tokenize_function, batched=True)\n", "tokenized_dataset = tokenized_dataset.remove_columns([\"text\"])\n", "print(\"Tokenisierter Datensatz:\")\n", "print(tokenized_dataset)" ] }, { "cell_type": "code", "execution_count": 7, "id": "339445b0-2592-4fba-968d-3e6b3180a939", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:12.854659Z", "iopub.status.busy": "2026-03-24T18:13:12.854497Z", "iopub.status.idle": "2026-03-24T18:13:12.856695Z", "shell.execute_reply": "2026-03-24T18:13:12.856115Z", "shell.execute_reply.started": "2026-03-24T18:13:12.854649Z" } }, "outputs": [], "source": [ "# Erstelle einen DataCollator für das Language Modeling\n", "data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)" ] }, { "cell_type": "code", "execution_count": 8, "id": "ebd9e4e9-b9d5-4f6c-9ec3-193b0b79b688", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:12.857045Z", "iopub.status.busy": "2026-03-24T18:13:12.856986Z", "iopub.status.idle": "2026-03-24T18:13:12.872023Z", "shell.execute_reply": "2026-03-24T18:13:12.871574Z", "shell.execute_reply.started": "2026-03-24T18:13:12.857039Z" } }, "outputs": [], "source": [ "# Definiere Trainingsargumente – das Fine-Tuning erfolgt exemplarisch über wenige Epochen\n", "training_args = TrainingArguments(\n", " output_dir=\"./llm_ai_act_finetuned\",\n", " num_train_epochs=3,\n", " per_device_train_batch_size=1,\n", " save_steps=5,\n", " save_total_limit=2,\n", " logging_steps=1,\n", " learning_rate=5e-5,\n", " weight_decay=0.01,\n", ")" ] }, { "cell_type": "code", "execution_count": 9, "id": "a40e68eb-7c6f-436f-a04a-004686432039", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:12.872494Z", "iopub.status.busy": "2026-03-24T18:13:12.872418Z", "iopub.status.idle": "2026-03-24T18:13:12.984566Z", "shell.execute_reply": "2026-03-24T18:13:12.984143Z", "shell.execute_reply.started": "2026-03-24T18:13:12.872487Z" } }, "outputs": [], "source": [ "# Initialisiere den Trainer für das Fine-Tuning\n", "trainer = Trainer(\n", " model=model,\n", " args=training_args,\n", " train_dataset=tokenized_dataset,\n", " data_collator=data_collator,\n", ")" ] }, { "cell_type": "code", "execution_count": 10, "id": "6c146974-ab45-433b-b964-811f0a668c37", "metadata": { "execution": { "iopub.execute_input": "2026-03-24T18:13:12.985085Z", "iopub.status.busy": "2026-03-24T18:13:12.984991Z", "iopub.status.idle": "2026-03-24T18:13:15.690419Z", "shell.execute_reply": "2026-03-24T18:13:15.689949Z", "shell.execute_reply.started": "2026-03-24T18:13:12.985078Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Starte das Fine-Tuning mit AI Act-Daten...\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "/Users/veit/cusy/trn/ai-tutorial/.venv/lib/python3.13/site-packages/torch/utils/data/dataloader.py:775: UserWarning: 'pin_memory' argument is set as true but not supported on MPS now, device pinned memory won't be used.\n", " super().__init__(loader)\n", "`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.\n" ] }, { "data": { "text/html": [ "\n", "
| Step | \n", "Training Loss | \n", "
|---|---|
| 1 | \n", "2.830693 | \n", "
| 2 | \n", "3.512209 | \n", "
| 3 | \n", "2.700200 | \n", "
| 4 | \n", "3.442378 | \n", "
| 5 | \n", "2.657775 | \n", "
| 6 | \n", "3.438793 | \n", "
"
],
"text/plain": [
"