Back to lab notes
Qwen

April 2026 · Foundry

We fine-tuned a 4B to outrun models five times its size

Starting from Qwen3-4B-Thinking, we built a reasoning-distilled variant using LoRA via Unsloth to approximate the advanced reasoning of the larger Qwen3.6-plus teacher. The distillation target is not “more tokens” but less noise: we reduce rambling and hedging in favour of structured, checkable answer paths.

What we changed

Smaller “thinking” models often imitate long chain-of-thought without the calibration of a frontier teacher: they meander, second-guess in public, and bury the actual algorithm under performative self-dialogue. We distilled with traces aligned to Qwen3.6-plus so the student learns when to be brief, when to decompose, and when to state invariants up front.

The result is LeemerLabs/Qwen3-4B-Qwen3.6-plus-Reasoning-Distilled — same backbone family as the base checkpoint, with post-training that rewards concise structure over exploratory volume. In practice that means fewer “Hmm / Wait / Actually” loops and more problem → approach → steps → complexity → pseudocode, which is what human reviewers and downstream tool callers actually need.

Multi-domain benchmark

We evaluate on a multi-domain reasoning suite (same methodology as LeemerLabs/Multi-Domain-Reasoning-Benchmark). The chart below compares our distilled 4B against the Qwen3-4B-Thinking-2507 base on success rate per task class.

Multi-Model Reasoning Performance Comparison

Benchmark: LeemerLabs/Multi-Domain-Reasoning-Benchmark

Success rate (%)

LeemerLabs (distilled)Qwen (base)

LeemerLabs/Qwen3-4B-Qwen3.6-plus-Reasoning-DistilledQwen/Qwen3-4B-Thinking-2507

Scientific explanation (RAG)

100.0%
100.0%

Ethical dilemma

66.4%
74.5%

Complex scenario analysis

61.8%
60.9%

Constrained creative writing

36.4%
34.5%

Logical reasoning

60.0%
60.0%

Mathematical reasoning

100.0%
100.0%

Planning & optimization

56.4%
38.2%

Python code analysis & debugging

95.5%
69.1%

SQL query generation

81.8%
100.0%

Causal reasoning (RAG)

98.2%
100.0%

Largest relative lifts for the distilled model show up in Python code analysis & debugging and planning / optimisation — the kinds of problems where a clean plan matters more than vocabulary size. Trade-offs remain (for example, SQL in this run still favours the base checkpoint); we publish the numbers so you can map them to your own stack.

Reasoning style: base vs. distilled

Base (Qwen3-4B-Thinking). Stream-of-consciousness and exploratory. The model often auditions several interpretations, rereads the prompt mid-trace, and can loop on self-correction before converging. Useful for research-style exploration — noisy as a default production interface.

Distilled (LeemerLabs, Qwen3.6-plus–aligned). Report-shaped: it separates input, output, constraints early, names an algorithmic plan (for graph-style tasks, e.g. state-space search), and holds a steadier line from reasoning to pseudocode. It is the same small backbone with behaviour closer to an engineering handoff than a live stream of consciousness.

Coming soon · Free tier · EU

The 4B on a free endpoint — then something much larger, still local.

We are wiring the distilled 4B to the public, free OpenAI-compatible gateway under the model id leemerlabs/qwen3-4b-power. Same EU residency story as the rest of LeemerLabs: requests stay on our infrastructure in Europe — no silent routing to the US for inference. When the slot is live, it will show up in the same console flow as the other public models; until then, request access and we will add you to the allowlist.

The 4B is the opening move. In parallel we are fine-tuning from a ~27B open-weight family, and we are not stopping there — the pipeline also targets an even stronger open-source teacher so we can distil a model that feels like a local, free-inference beast: serious reasoning depth without shipping prompts overseas or paying Silicon Valley by the token. Training and eval loops run on EU metal in Ireland, next to where we will serve the weights. More detail on aliases and rate limits when we cut the release notes.

Foundry · EU

The distilled weights are a LeemerLabs build — not a relabel of a public small-model drop.

If you want to fine-tune from the same family or run evals in our environment, start with access; we can point you at compatible tooling (Unsloth, LoRA targets, and eval harnesses) without mixing this up with the open-beta inference pair on the public gateway.