I Fine-Tuned a 7B Model on My Own Writing in One Evening — Here's Exactly How
A hands-on walkthrough of fine-tuning Mistral-7B on personal blog posts using LoRA on an NVIDIA DGX Spark — from training data to local deployment, in one evening.
Last weekend I did something that would have required a research lab three years ago: I fine-tuned a 7.2 billion parameter language model on my own blog posts, deployed it locally, and was chatting with a version of it that sounds like me — all in a single evening.
The hardware was an NVIDIA DGX Spark with a GB10 Blackwell GPU. The model was Mistral-7B. The tooling was HuggingFace, PEFT, and TRL. No cloud, no Docker, no external API calls.
Here's exactly what I did, what worked, and what blew up in my face.

The Hardware: Why the DGX Spark Changes the Calculation
The GB10 Blackwell GPU in the DGX Spark gives you 120GB of unified CPU/GPU memory. That single number changes what's possible at the hobbyist level.
Most consumer GPU setups force you into quantization (loading the model at reduced precision to fit in VRAM) or model sharding (splitting the model across multiple GPUs). Both approaches add complexity and reduce quality. With 120GB unified memory, I loaded the full Mistral-7B model in bf16 precision — no quantization, no sharding. Just load and go.
The software stack: Ubuntu, CUDA 13.0, Python 3.12, PyTorch 2.10.0+cu130.
Phase 1: Learning the Fundamentals First
Before touching a 7B model, I worked through the open-source tutorial repo gps-llm-training-fundamentals. This isn't hand-waving — these fundamentals matter when something breaks during training and you need to know why.
The curriculum I followed:
Part 3 — The math that matters: Cross-entropy loss, softmax, gradient descent. Not just "here's the formula" but building intuition for what it means when your loss stalls.
Part 4 — Build a mini-GPT from scratch: 807K parameters, character-level tokenizer, trained in 25 seconds on GPU. Loss dropped from 2.92 to 0.09. The number that matters isn't the final loss — it's understanding why it dropped that fast.
Part 5 — Self-attention and causal masking: Understanding why transformers can only see backwards (during training) and how that shapes the architecture.
Part 6 — Tokenization from scratch: Built a BPE tokenizer, then compared it against tiktoken (GPT-4's tokenizer). The hands-on comparison killed a lot of my wrong assumptions.
Part 7 — LoRA on TinyLlama 1.1B: This is the critical one. LoRA fine-tuned 0.41% of the model's parameters and trained in 3 seconds. That result — less than half a percent of params — is what made me confident the approach would scale to a 7B model.
Don't skip the fundamentals because you're excited to get to the big model. You'll pay for it later.
Phase 2: Fine-Tuning Mistral-7B on My Own Writing
The Model
Base model: mistralai/Mistral-7B-Instruct-v0.3
7.2 billion parameters. About 14GB download. Instruction-tuned (already knows how to follow prompts), which means I'm not training from scratch — I'm steering existing capability toward my style.
The Training Data
This is where it gets honest: I wanted to train on my blog posts. My crawler hit the bot protection on grizzlypeaksoftware.com and got 403'd after 6 articles. So I supplemented with 161 tech writing examples filtered from the OpenHermes-2.5 dataset on Hugging Face.
Final training set: 224 examples in JSONL format.
A Python pipeline split each article into multiple training examples — question/answer pairs and continuation prompts — then processed everything through the Mistral chat template. The template matters: Mistral-Instruct expects a specific format, and if you feed it raw text, you're wasting training signal.
# Mistral chat template format
def format_training_example(instruction, response):
return f"<s>[INST] {instruction} [/INST] {response}</s>"
The LoRA Configuration
LoRA (Low-Rank Adaptation) is the reason this is feasible on a single machine. Instead of updating all 7.2 billion parameters, you inject small trainable matrices into the attention layers and only train those.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=32, # Rank — controls adapter size
lora_alpha=64, # Scaling factor (typically 2x rank)
target_modules=[ # Which layers to adapt
"q_proj",
"k_proj",
"v_proj",
"o_proj"
],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 83,886,080 || all params: 7,325,921,280 || trainable%: 1.14%
83.8 million trainable parameters out of 7.2 billion. 1.14%.
That's not a typo. You're moving 1.14% of the dials and the model shifts its behavior meaningfully. This is why LoRA won.
Training Configuration
from transformers import TrainingArguments
from trl import SFTTrainer
training_args = TrainingArguments(
output_dir="./mistral-gps-finetuned",
num_train_epochs=5,
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # Effective batch size = 16
learning_rate=1e-4,
lr_scheduler_type="cosine",
bf16=True, # Full precision on GB10
gradient_checkpointing=True, # Trade compute for memory
logging_steps=10,
save_strategy="epoch",
warmup_ratio=0.03,
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
max_seq_length=2048,
)
trainer.train()
Results
| Metric | Value | |---|---| | Starting loss | 1.05 | | Final loss | 0.09 | | Token accuracy | 97% | | Training time | 33 minutes | | LoRA adapter size | 320MB | | Base model size | ~14GB |
33 minutes. Loss from 1.05 to 0.09. On hardware I own outright, with no cloud billing meter running.
The adapter — the thing that encodes the style shift — is 320MB. The base model stays unchanged at 14GB. You can swap adapters in and out without reloading the base model, which opens interesting possibilities for serving multiple fine-tuned personalities from a single base.
Phase 3: Serving It Locally
No Flask. No FastAPI. No Docker. I wrote a simple Python HTTP server using the stdlib http.server module with two endpoints:
POST /v1/completions— OpenAI-compatible completion endpointGET /— Single-page browser UI
The serving code loads the base model once, applies the LoRA adapter using PEFT's PeftModel.from_pretrained(), and runs streaming generation with TextIteratorStreamer:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from peft import PeftModel
import threading
# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Apply LoRA adapter
model = PeftModel.from_pretrained(base_model, "./mistral-gps-finetuned/checkpoint-final")
model.eval()
def generate_streaming(prompt, temperature=0.7, top_p=0.9):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
streamer = TextIteratorStreamer(
tokenizer,
skip_prompt=True,
skip_special_tokens=True
)
generation_kwargs = {
**inputs,
"streamer": streamer,
"max_new_tokens": 512,
"temperature": temperature,
"top_p": top_p,
"do_sample": True,
}
thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for token in streamer:
yield token
The browser UI offered streaming markdown output via marked.js, temperature and top-p sliders with plain-English labels ("More creative" → "More focused"), a Stop button, copy, and save-as-markdown. One Python process, no external dependencies beyond the model.
What I Actually Learned
The 120GB unified memory is the unlock
No quantization means no quality degradation. No model sharding means no orchestration complexity. The GB10 makes what was previously a multi-GPU cluster problem into a single-machine problem. That's not an incremental improvement — it's a category shift.
LoRA made the math work
224 training examples. 1.14% of parameters. 33 minutes. The model convincingly adopted the writing voice from 6 articles. Three years ago, fine-tuning a model this size would have required a team, a cluster, and weeks. The tooling (HuggingFace transformers, PEFT, TRL) has matured to the point where a solo developer can do this in an evening.
It hallucinated code. Confidently.
This is the part I want to be direct about: when I asked the fine-tuned model to show implementation details, it invented plausible-looking code that was completely wrong. It generated an OpenAI API client calling GPT-4o instead of the actual local inference code I'd written.
The model learned my style. It did not learn the facts of how my specific system was built.
This is not a bug in Mistral or in LoRA. It's a fundamental property of how language models work. Fine-tuning on style does not inject factual knowledge about your specific implementation. If you want the model to know specific facts, you need to provide those facts in the prompt at inference time — RAG, system prompts, or structured context injection.
Prompt engineering matters even with a fine-tuned model. Maybe especially with one, because the model's confidence in its (wrong) answers goes up.
The tooling is ready
The full stack from first tutorial to deployed model took one evening. The HuggingFace ecosystem has hit an inflection point for accessibility. If you have a GPU with enough memory and a willingness to read documentation, this is within reach for any experienced developer.
The Files That Matter
After training, your directory structure looks like this:
mistral-gps-finetuned/
├── checkpoint-epoch-1/
│ ├── adapter_config.json # LoRA configuration
│ ├── adapter_model.safetensors # The 320MB adapter weights
│ └── tokenizer files
├── checkpoint-epoch-2/
│ └── ...
└── checkpoint-final/
└── ... # The checkpoint you actually use
The adapter_model.safetensors file is what you trained. Everything else in that directory is configuration and tokenizer state. Back this up — it represents 33 minutes of compute and your training data.
What's Next
A few directions I'm planning to explore:
RAG over the real codebase — Instead of hoping the model remembers implementation details, inject the actual source files as context at inference time. This is the correct architecture for "chat with your codebase" use cases.
Multiple adapters, one base model — If I fine-tune one adapter per content domain (databases, Azure DevOps, Node.js), I can swap adapters at request time without reloading the 14GB base model. That's an interesting serving architecture for a content site.
Longer training data pipeline — The bot protection on the site killed the crawler after 6 articles. The right fix is generating training data directly from the CMS export, not scraping the live site.
Fine-tuning a production-grade language model is no longer a research-lab activity. With the right hardware, the right tooling, and the willingness to debug your way through a few error messages, it's an evening project. The ceiling moved — substantially.
Shane is the founder of Grizzly Peak Software — a technical content hub for software engineers who've been in the industry long enough to have opinions. He writes from a cabin in Caswell Lakes, Alaska.