Background decoration
Veja Outros Artigos
DD

Diogo Barros

Criado em 2 Dec 2025 0 min(s) de leitura

Atualizado há 2 semanas

1. Goal

To create an end-to-end MLOps pipeline that automatically fine-tunes a pre-trained GPT-2 Causal Language Model (CLM) on custom PDF documents. The pipeline is triggered by file uploads and utilizes an existing Azure Kubernetes Service (AKS) cluster for all compute workloads (processing, training, and inference).


Architecture Overview

We’re focusing on using our AKS Cluster. Azure Machine Learning (AML) acts as the control plane, sending jobs to run on your AKS nodes.

ComponentService UsedPurpose
Data LakeAzure Blob Storage (ABS)Stores raw PDFs (/inputs) and processed text (/processed).
OrchestratorAzure Data Factory (ADF)Listens for file uploads and triggers the ML pipeline.
Control PlaneAzure Machine Learning (AML)Manages the experiment, logs metrics, registers models, and defines the pipeline structure.
Compute EngineAzure Kubernetes Service (AKS)CRITICAL: Attached to AML as a Compute Target. Runs the Docker containers for Pre-processing, Training, and Inference.
ArtifactsAzure Container Registry (ACR)Stores the Docker environment (Python, PyTorch, Hugging Face) used by the AKS pods.

The Automated Workflow (Step-by-Step)

This pipeline executes automatically when data arrives.

Phase 1: Data Ingestion (Event Trigger)

  1. User Action: A user uploads a PDF to the gpt2-training-data container in Azure Blob Storage.
  2. Trigger: An Azure Data Factory (ADF) "Storage Event Trigger" detects the Blob Created event.
  3. Action: ADF triggers the Azure Machine Learning Pipeline.

Phase 2: Pre-Processing (Running on AKS)

  1. Job Launch: The AML Pipeline submits the "Pre-processing Step" to the AKS Compute Target.
  2. Execution: A pod spins up on your AKS cluster using a standard CPU node.
  3. Logic:
    • Reads the new PDF using pypdf.
    • Cleans and extracts text.
    • Appends text to a combined_documents.txt file in Blob Storage.
    • Note: This prepares the cumulative dataset for the model.

Phase 3: Fine-Tuning (Running on AKS)

  1. Job Launch: The AML Pipeline submits the "Training Step" to the AKS Compute Target.
  2. Resources: This pod requests GPU resources (if your AKS has GPU nodes) or high-memory CPU nodes.
  3. Logic:
    • Loads combined_documents.txt.
    • Downloads gpt2 base model.
    • Runs the Hugging Face Trainer (e.g., 3 epochs, $Batch \text{ } Size=8$).
    • Logs loss metrics back to the AML Workspace.
  4. Output: The fine-tuned model weights are saved to the AML Model Registry.

Phase 4: Deployment (Hosting on AKS)

  1. Deployment Trigger: Upon successful training, the pipeline triggers a deployment step.
  2. Service Creation: AML deploys the model as a Real-time Inference Service on the AKS Cluster.
  3. Endpoint: A REST API endpoint (e.g., http://<aks-ip>/score) is exposed, allowing applications to send prompts and receive text.

MLOps Stage Mapping

Your StepMLOps Stageimplementation Details
1. Upload PDFData IngestionBlob Storage: User drops file into container.
2. Trigger flowOrchestrationADF: Event trigger calls AML Pipeline.
3. Fine-tuneTrainingAKS (Training Pod): Runs the Python training script via AML extension.
4. Input/OutputInferenceAKS (Inference Pod): Hosts the model API for chat.

Notes

To make this work on your existing cluster, one setup step is required:

  1. Attach AKS to AML:Bash

    You must install the AzureML extension on your Kubernetes cluster.

    az k8s-extension create --name aml-adapter --extension-type Microsoft.AzureML.Kubernetes --cluster-name <your-aks-cluster> --resource-group <your-rg> --cluster-type managedClusters
    
  2. Compute Target Creation:

    Inside Azure ML Studio, go to Compute -> Attached Compute -> New -> Kubernetes and select your AKS cluster. Name it aks-compute-target.


➕ Web UI Add-On (The "Demo" Layer)

To make this usable, we will build a lightweight Streamlit or Flask app.

Page 1: The "Knowledge Base" (Admin)

  • Function: Drag-and-drop file uploader.
  • Action: Uploads file directly to the Azure Blob Storage input container.
  • Status View: Instead of "waiting," this page queries the AML Run History API.
    • State A: "Idle"
    • State B: "Training in Progress (AKS Pod Running)..."
    • State C: "New Model Deployed"

Page 2: The "Chat" (User)

  • Function: A simple Chat Interface (like ChatGPT).
  • Action: When the user hits send, the app sends a POST request to the AKS Inference Endpoint.
  • Response: Displays the GPT-2 generated text.

➕ Addon: Versioned or Immutable Context File

  • Instead of constantly modifying combined_documents.txt, have the pre-processing script read all existing PDFs, combine them, and write the output to a new, timestamped file, e.g., /processed/combined_data_20251130_1615.txt.
  • The Training Step would then simply load the latest file in the /processed folder. This ensures the training dataset is always a snapshot in time, which is critical for reproducibility.

Don’t know where to start?

Start by fine-tuning the gpt-2 model in your machine with a Python Notebook. Then, try to connect the upload of a new pdf in a new container, adapting the notebook of AML to the remote storage. Try to have kubernetes run the flow of AML.

# Cell 1 - Install these and other requirements python may ask
!pip install pypdf transformers datasets accelerate torch
# Cell 2 - Extract text from pdf
import os
from pypdf import PdfReader

def extract_text_from_pdfs(pdf_dir="data/pdfs"):
    """Extracts text from all PDFs in a directory and returns a single large string."""
    all_text = ""
    for filename in os.listdir(pdf_dir):
        if filename.endswith(".pdf"):
            filepath = os.path.join(pdf_dir, filename)
            try:
                reader = PdfReader(filepath)
                for page in reader.pages:
                    all_text += page.extract_text() + "\\n"
            except Exception as e:
                print(f"Could not read {filename}: {e}")

    # Save the combined text to a file for easy loading later
    with open("combined_documents.txt", "w", encoding="utf-8") as f:
        f.write(all_text)

    return "combined_documents.txt"

# Assume your PDFs are in a folder named 'data/pdfs'
text_file_path = extract_text_from_pdfs()
print(f"Text extracted and saved to: {text_file_path}")

# Cell 3 - Load text into Dataset, load GPT-2, Define Tokenization
from transformers import AutoTokenizer
from datasets import load_dataset

# 1. Load the text file into a Hugging Face Dataset
# We use 'text' dataset builder which loads a file line-by-line
raw_datasets = load_dataset('text', data_files={'train': 'combined_documents.txt'})

# 2. Load the GPT-2 Tokenizer
tokenizer = AutoTokenizer.from_pretrained('gpt2')

# Add a padding token to GPT-2 (necessary for batch processing, though not strictly required
# for Causal Language Modeling if using DataCollatorForLanguageModeling)
tokenizer.pad_token = tokenizer.eos_token

# 3. Tokenization Function
def tokenize_function(examples):
    # The key in the dataset is 'text' since we used the 'text' dataset loader
    return tokenizer(examples["text"])

tokenized_datasets = raw_datasets.map(
    tokenize_function,
    batched=True,
    num_proc=4, # Use multiple processes for faster tokenization
    remove_columns=raw_datasets["train"].column_names # Remove original text column
)
# Prepare
block_size = 1024 # Standard block size for GPT-2

def group_texts(examples):
    # Concatenate all texts
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])

    # The line that drops the last small block is correctly commented out:
    # total_length = (total_length // block_size) * block_size 

    # Split by block_size, allowing the last one to be < block_size
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    # Add 'labels' for CLM, where labels are the input_ids shifted
    result["labels"] = result["input_ids"].copy()
    return result

lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    batch_size=1000,
    num_proc=4,
)

# --- CORRECTED SPLIT LOGIC ---
# Get the grouped dataset (which currently only has a 'train' split)
main_dataset = lm_datasets['train']
total_blocks = len(main_dataset)

# If we have very few blocks (less than 2), train_test_split will fail.
# Use the whole set for both train and validation in this case.
if total_blocks < 2:
    print(f"Dataset too small ({total_blocks} blocks) for 95/5 split. Using all blocks for both train and eval.")
    train_dataset = main_dataset
    eval_dataset = main_dataset
else:
    # Perform the split for larger datasets
    split_datasets = main_dataset.train_test_split(test_size=0.05)
    train_dataset = split_datasets["train"]
    eval_dataset = split_datasets["test"]
    
print(f"Total training blocks: {len(train_dataset)}")
print(f"Total evaluation blocks: {len(eval_dataset)}")
# Load the Model
from transformers import GPT2LMHeadModel, DataCollatorForLanguageModeling
import torch

# Load the PRE-TRAINED GPT-2 model
model_name = 'gpt2'
model = GPT2LMHeadModel.from_pretrained(model_name)

# Data Collator: Prepares batches of data for the model
# mlm=False is crucial for Causal Language Modeling (CLM)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Define Hyperparameters
from transformers import TrainingArguments

output_dir = "gpt2-finetuned-custom-docs"
logging_steps = 100

training_args = TrainingArguments(
    output_dir=output_dir,
    # Core Training Parameters
    num_train_epochs=3,                     # Number of epochs to run
    per_device_train_batch_size=4,          # Adjust based on your GPU VRAM (e.g., 4, 8, or 16)
    per_device_eval_batch_size=4,
    learning_rate=5e-5,                     # Standard learning rate for fine-tuning
    weight_decay=0.01,

    # Evaluation and Logging
    eval_strategy="epoch",            # Evaluate at the end of each epoch
    logging_dir='./logs',
    logging_steps=logging_steps,
    save_strategy="epoch",                  # Save checkpoint at the end of each epoch
    load_best_model_at_end=True,            # Load the model with the best validation loss

    # Mixed Precision (Crucial for speed and VRAM on modern GPUs like A100/A40/30xx/40xx)
    fp16=torch.cuda.is_available() and torch.cuda.get_device_properties(0).major >= 7, # Enable if GPU supports it

    # Data Handling
    seed=42,
    gradient_accumulation_steps=8,          # Use small batch size, but simulate a larger one (4 * 8 = 32)
)

# Fine-Tune the model
from transformers import Trainer

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    tokenizer=tokenizer,
    data_collator=data_collator,
)

# Start Fine-Tuning!
print("Starting Fine-Tuning...")
trainer.train()
print("Fine-Tuning complete. Model saved.")

print(f"Number of samples in train_dataset: {len(train_dataset)}")
# Load the model into a pipeline and provide input, read output
from transformers import pipeline

# Load the trained model into a pipeline for easy generation
generator = pipeline(
    'text-generation',
    model=trainer.model,
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1 # Use GPU if available
)

prompt = "Based on the documents, who is Diogo Delgado Barros?"
output = generator(
    prompt,
    max_new_tokens=50,
    num_return_sequences=1,
    do_sample=True,          # Enable sampling for more creative output
    temperature=0.7,         # Controls randomness (lower is safer/less random)
    top_k=50,
    top_p=0.95
)[0]

print("\\n--- Generated Text ---")
print(output['generated_text'])
print("----------------------")

PDF and full working Notebook used for training can be found in this link.

Notes on using the fine-tuned model

1. Do not ask a question. Start the sentence.

We’re using data with examples (see this example below).

Instead of asking “Who is Diogo Delgado Barros?”, use:

prompt = "Declaration: Diogo Delgado Barros is"

2. Drop in the Ocean

Let’s say we provide the model with one sentence of new information (e.g. "Diogo is the supreme leader of the alien colony Dinosaurus."), but the GPT-2 model has been pre-trained on billions of sentences from the internet where names like "Diogo" or generic profiles are associated with "US Congress," "Marines," or "Ambassadors."

Because your training dataset is so tiny (literally a few bytes), the model may ignore it and revert to its "default" world knowledge (hallucinating plausible-sounding careers).

To force the model to "memorize" this specific fact about the Alien Colony Dinosaurus, you must aggressively overfit the model.

Here is exactly what you need to adjust in your code to make it work:

2. The "Sledgehammer" Fix (Training Arguments)

The example above runs 3 epochs. For a single sentence, this is not enough. We need to hammer this sentence into the weights.

Update your training_args in Cell 4:

training_args = TrainingArguments(
    output_dir=output_dir,
    # CHANGE 1: Increase Epochs drastically (Force memorization)
    num_train_epochs=20,
    
    # CHANGE 2: Increase Learning Rate (Learn faster/harder)
    learning_rate=2e-4,   
    
    per_device_train_batch_size=4,
    save_strategy="no",    # Don't save checkpoints, just the final model
    overwrite_output_dir=True,
    seed=42
)

3. Language Issue

Note, the documents you provide in the data/pdfs folder has to be in english (because that’s how GPT-2 is trained).

white lightBackground decoration