AI · Models·April 2026·7 min read

Inference vs training for LLMs — who pays for what.

In a finance review someone asks, how much does the model cost, and one side answers with a training job quote while another answers with per-token API pricing. Both are right — in different contexts. Mixing them breaks Capex vs Opex planning [4].

Inference is running the model for a user request. Training is updating weights (or adapters) on batches. The team that treats them as one line item is surprised when usage scales [1]; see also the SLM vs API article and the 2026 LLM guide.

What you pay in training.

Pretrain or finetune consumes GPU hours — memory-heavy runs, data pipelines, and repeated batch passes. You usually price it as $/GPU-hour (or cluster-hour) plus storage, checkpoints, and validation [4].

Even LoRA is still training: it still ships files and still needs evaluation discipline [3].

What you pay in inference.

Inference charges per user flow: request volume, context length, KV memory footprint, and your SLO. Cloud vendors price per million tokens; private serving prices hours and power [1][5].

A model trained once is not free to run: you convert a one-time build story into a continuous token economy [4].

Training buys behavior. Inference bills you every time that behavior is used. If you mix the two, you mix your budget too.

A decision table: when inference dominates the math.

When production tokens/month beat the margin you hoped to get from more training [5].
When product latency and concurrency matter — then PagedAttention and vLLM class engines enter the cost model [2].
When data crosses borders — PDPL + AI arguments apply to both training and serving [4].

Frequently asked questions.

Is API always inference? Economically, yes: you are paying a vendor for inference and surrounding infrastructure, not for owning weights [1].
Is training always more expensive? Not always: a short heavy training project can be cheaper than years of very large-scale inference at list price [5].
What do I put in a contract? Separate caps for inference (tokens) from caps for finetune / adapter update cycles [3].
How do I compare quotes? Request the same workload: same precision, p95, and $/1M tokens — not incomparable microbenchmarks [5].
Where do GPU families show up? Different chips move both training and inference — just at different $/useful token [2].

Closing and invitation.

Split the spreadsheet: a row for one-time or periodic training, and a row for monthly tokens for inference. Without that split, engineering wins and finance pain point at each other [4].

This quarter, write one number: how many million tokens per month in production. If you beat the plan, you are not tuning the model — you are learning inference economics [5].

Sources.

[1] OpenAI — API pricing (per-token, verify current).

[2] NVIDIA — Data center GPU product families.

[3] Hu et al. — LoRA (ICLR 2022).

[4] Nuqta — internal TCO templates, April 2026.

[5] Cloud pricing patterns — match to your provider and contract tier.

What is a large language model — complete guide for 2026.
This is not a glossary entry. It is the operating calculation behind LLM decisions in 2026: how the model works, where it fails, and how to choose the right deployment path.
What is LoRA — and how it cuts fine-tuning cost.
When people say fine-tuning, many still picture updating billions of weights in an expensive full pass. LoRA freezes the base and injects a low-rank delta into selected linear paths — often enough to shift behavior on a narrow task without shipping a full weight copy. This article explains the idea without hype, and when savings move from slides to investment [1].
When a small on-prem model beats a cloud API subscription.
This is not anti-cloud. It is a spreadsheet: when an open small or medium model on your own GPU wins on three-year TCO and compliance — and year-one math lies if you ignore context and labor.
GPT-4 vs Claude vs Gemini — an objective comparison.
This is not a popularity vote. It is a decision frame: what differentiates each family, where each leads, where each weakens, and how to choose without buying the myth of a single "best" model.
How the Transformer works — a plain-language guide.
"Attention Is All You Need" changed the industry, but it does not belong in a product review meeting. This is the version for builders: one mechanism called attention, reweighting importance between tokens based on context — without a single equation.

Explore the hub

Arabic & AI

Arabic LLMs, model comparisons, and conversational agents.

Share this article

X (Twitter)LinkedIn WhatsApp

← Back to the JournalNuqta · Journal