Founding lineup · April 2026

Two models live

Our starting models.

Every catalogue has to start somewhere. We chose two: a 350M Liquid model that is almost impossibly fast, and a 26B Gemma 4 MoE that is almost impossibly capable for its active parameter count. One handles the hot path of LeemerChat. The other handles the hard questions. Both are open-weight, both are hosted in Ireland.

Speed tier

Liquid LFM2.5-350M

alias · lfm2.5-350m-free

Liquid AI's 350M-parameter model, built on Linear Input-Varying Systems rather than transformers. Trained on 28 trillion tokens. It runs on a Raspberry Pi. The model is capable of 40,400 tokens per second on an H100 — on the free tier we serve it at a throttled but still blazing 200 tok/s.

350M

Parameters

32K

Context

200

tok/s free tier

40.4K

tok/s capable

Depth tier

Gemma 4 26B A4B

alias · gemma4-26b-a4b

Google DeepMind's mixture-of-experts Gemma 4: 25.2B total parameters, only 3.8B active per token. Multimodal, 256K context, reasoning mode built in. Runs almost as fast as a dense 4B while thinking like a 26B.

25.2B

Total params

3.8B

Active

256K

Context

40+

tok/s served

Why these two

A fast model and a smart model. Not one trying to be both.

Most inference stacks pick a single mid-size generalist and pay for it on every request — including the ones that are really just “pick a route” or “summarise this into a title.” That is expensive, and it is slower than it needs to be. We split the work.

Liquid LFM2.5-350M is our hot path. It is not the smartest model in the world — Liquid AI is transparent about that — but at 350M parameters with an 80,000-to-one token-to-param ratio, it is extraordinarily dense for structured work. Gemma 4 26B A4B is our depth tier: an MoE that activates only 3.8B parameters per token, so it runs fast for its size, but reasons, codes, and parses images like a model ten times the active count.

Together they cover the full shape of a real product. The small one is what you throw nine out of ten requests at. The big one is what you reach for when the request deserves it.

The speed model · Liquid

The model that touches every request in LeemerChat.

LFM2.5-350M is everywhere in LeemerChat. Before a message ever reaches a frontier model it passes through Liquid: intent routing, tool selection, conversation title generation, safety gating, the tiny classifications that make the product feel fast. The model is capable of 40,400 tokens per second; on the public free tier we throttle it to 200 tok/s — still faster than anything you will feel, and comfortably within our fair-use envelope for a free model.

It is also the model we run on the edge. An offline mode for LeemerChat is coming — the user downloads the weights once, and the whole thing runs in the browser, on-device, with no round-trip to our servers. No cloud, no API key, no latency floor. A 350M model with an 81MB minimum memory footprint is the only class of model where that is actually possible today.

We are honest about what it is not: not a writer, not a mathematician, not a coder. It is the fastest correct answer to a simple question, and the fastest correct routing decision to a hard one.

Routing

Every LeemerChat request starts here

hot path

Titles

Conversation titles, auto-summaries

background

Offline mode

Full browser-side inference — soon

coming

Agent tool-use

Function calling, JSON extraction

production

intelligence budget

10×active params

vs. our 350M speed model

Gemma 4 26B A4B activates 3.8B parameters per token — more than ten times what LFM2.5-350M has in total. That is the difference between “route this request” and “read this repo and propose a fix.” You pay for it in latency, but we keep it well above an industry standard threshold.

82.6%

MMLU Pro

77.1%

LiveCodeBench v6

88.3%

AIME 2026

The depth model · Gemma 4

Slower than Liquid. Worth every millisecond.

Gemma 4 is Google DeepMind's open-weight family, and 26B A4B is the mixture-of-experts variant we chose: 25.2B total parameters with 8 of 128 experts (plus one shared) active per token, giving an effective footprint of 3.8B. The architecture interleaves local sliding-window attention with global attention, with unified KVs and proportional RoPE on the global layers, so a 256K context actually fits in memory instead of theoretically fitting.

It is multimodal — text and image in, text out, with variable-aspect vision and a 550M-parameter vision encoder. It has a real thinking mode. Native function calling. Native system prompts. Trained for coding and agentic work. On LiveCodeBench v6 it beats the previous Gemma 3 27B by almost fifty points.

It is a lot slower than a 350M model. That is the point — this is the tier you reach when correctness is worth the wait. We still serve it at a sustained 40+ tokens per second, which is the open-source industry bar for a model this size.

Side by side

Two models, one job description each.

Dimension

LFM2.5-350M

Gemma 4 26B A4B

Role in LeemerChat

Routing, titles, offline

Reasoning, long context, vision

Architecture

LIV (non-transformer)

MoE · 8 of 128 experts

Parameters

350M dense

25.2B total / 3.8B active

Context window

32K tokens

256K tokens

Modalities

Text

Text, Image

Served speed (free)

200 tok/s (capable of 40,400)

40+ tok/s

Runs locally

Phone, Pi, laptop

Workstation / consumer GPU

Best at

Structured tasks, function calls

Coding, math, agentic work

Two models · hosted in Ireland · OpenAI-compatible

Route to Liquid. Reason with Gemma. Pay for neither until you do.

Both models are available on the free tier of our OpenAI-compatible gateway today. Fine-tuning and custom variants run through Foundry.

Fine-tune on Foundry Request access