Practice10 Apr 2026 · 10 min read

QLoRA on a single GPU: real costs, real trade-offs

Training Born-9B v0 on a single RTX 6000 Ada cost under €10. Here's the full hardware breakdown, the hyperparameters, and what we'd change for v1.

Hardware setup

Born-9B v0 was trained on a single Runpod instance with an RTX 6000 Ada (48 GB VRAM). The total compute bill for the v0 training run was under €10. This is not a typo — QLoRA training of a 9B parameter model is remarkably accessible on current hardware.

ParameterValue

GPU1 × NVIDIA RTX 6000 Ada (48 GB)

Cloud providerRunpod

Base modelQwen/Qwen3.5-9B

MethodQLoRA (4-bit NF4 quantization)

LoRA rank64

LoRA alpha128

Learning rate2e-4 (cosine schedule)

Batch size4 (gradient accumulation 4)

Epochs3

Max seq length4096 tokens

Total compute cost< €10

Why QLoRA and not full fine-tuning?

A full fine-tune of a 9B parameter model requires at least 4× A100 80 GB GPUs (or equivalent) and costs hundreds to thousands of euros per training run. QLoRA reduces this to a single 48 GB GPU by quantizing the base model to 4-bit and training only the low-rank adapters in 16-bit.

The trade-off is real: QLoRA produces a smaller behavioral shift than full fine-tuning. For a distillation project like Born-9B, this means the dataset quality becomes even more critical. A small adapter has a smaller capacity for error correction — every noisy row in the training data has a proportionally larger effect on the final model.

Training dynamics

The v0 training run completed in approximately 45 minutes for 3 epochs over 225 rows. Loss converged cleanly without obvious overfitting, though the small dataset makes epoch-level loss an unreliable signal.

For v1, we expect the training time to increase to roughly 3-5 hours as the dataset grows to 2,000+ rows. The hardware will remain the same — the entire Born training infrastructure is designed to run on a single high-end consumer GPU.

What we'd change for v1

Higher LoRA rank. Rank 64 was conservative. For v1, we'll experiment with rank 128, which gives the adapter more capacity to shift model behavior without significantly increasing VRAM usage on the RTX 6000 Ada.

Longer max sequence length. Many agent-style tasks produce responses in the 4,000-6,000 token range. Truncating at 4,096 may have harmed quality on the longer tasks. v1 will train with a 6,144 token context.

DPO pass. The v1 dataset includes preference pairs (DPO format). After the SFT pass, we plan to run a DPO training phase to further refine response quality.

Accessibility as a design principle

Born exists partly to prove that meaningful model development does not require a 100-GPU cluster. The constraint of a single RTX 6000 Ada is not a limitation — it's a design choice. If we can produce useful models on accessible hardware with transparent data, the barrier to entry for applied AI drops by orders of magnitude.

Every hyperparameter, every cost number, and every training log from Born-9B will be published. If someone wants to reproduce this work on their own Runpod instance, they should be able to do so from the documentation alone.

Evaluation-first post All posts