Inference vs training for LLMs — who pays for what.
Faisal Al-Anqoodi · Founder & CEO
Training might run once (or for many hours) and you pay a cluster bill. Inference runs forever and turns a model into a per-token Opex line. This article separates the two checkbooks so pilot budgets are not mixed with product bills [1].
In a finance review someone asks, how much does the model cost, and one side answers with a training job quote while another answers with per-token API pricing. Both are right — in different contexts. Mixing them breaks Capex vs Opex planning [4].
Inference is running the model for a user request. Training is updating weights (or adapters) on batches. The team that treats them as one line item is surprised when usage scales [1]; see also the SLM vs API article and the 2026 LLM guide.
What you pay in training.
Pretrain or finetune consumes GPU hours — memory-heavy runs, data pipelines, and repeated batch passes. You usually price it as $/GPU-hour (or cluster-hour) plus storage, checkpoints, and validation [4].
Even LoRA is still training: it still ships files and still needs evaluation discipline [3].
What you pay in inference.
Inference charges per user flow: request volume, context length, KV memory footprint, and your SLO. Cloud vendors price per million tokens; private serving prices hours and power [1][5].
A model trained once is not free to run: you convert a one-time build story into a continuous token economy [4].
Training buys behavior. Inference bills you every time that behavior is used. If you mix the two, you mix your budget too.
A decision table: when inference dominates the math.
- When production tokens/month beat the margin you hoped to get from more training [5].
- When product latency and concurrency matter — then PagedAttention and vLLM class engines enter the cost model [2].
- When data crosses borders — PDPL + AI arguments apply to both training and serving [4].
Frequently asked questions.
- Is API always inference? Economically, yes: you are paying a vendor for inference and surrounding infrastructure, not for owning weights [1].
- Is training always more expensive? Not always: a short heavy training project can be cheaper than years of very large-scale inference at list price [5].
- What do I put in a contract? Separate caps for inference (tokens) from caps for finetune / adapter update cycles [3].
- How do I compare quotes? Request the same workload: same precision, p95, and $/1M tokens — not incomparable microbenchmarks [5].
- Where do GPU families show up? Different chips move both training and inference — just at different $/useful token [2].
Closing and invitation.
Split the spreadsheet: a row for one-time or periodic training, and a row for monthly tokens for inference. Without that split, engineering wins and finance pain point at each other [4].
This quarter, write one number: how many million tokens per month in production. If you beat the plan, you are not tuning the model — you are learning inference economics [5].
Sources.
[1] OpenAI — API pricing (per-token, verify current).
[2] NVIDIA — Data center GPU product families.
[3] Hu et al. — LoRA (ICLR 2022).
[4] Nuqta — internal TCO templates, April 2026.
[5] Cloud pricing patterns — match to your provider and contract tier.
Related posts
- What is a large language model — complete guide for 2026.
This is not a glossary entry. It is the operating calculation behind LLM decisions in 2026: how the model works, where it fails, and how to choose the right deployment path.
- What is LoRA — and how it cuts fine-tuning cost.
When people say fine-tuning, many still picture updating billions of weights in an expensive full pass. LoRA freezes the base and injects a low-rank delta into selected linear paths — often enough to shift behavior on a narrow task without shipping a full weight copy. This article explains the idea without hype, and when savings move from slides to investment [1].
- When a small on-prem model beats a cloud API subscription.
This is not anti-cloud. It is a spreadsheet: when an open small or medium model on your own GPU wins on three-year TCO and compliance — and year-one math lies if you ignore context and labor.
- GPT-4 vs Claude vs Gemini — an objective comparison.
This is not a popularity vote. It is a decision frame: what differentiates each family, where each leads, where each weakens, and how to choose without buying the myth of a single "best" model.
- How the Transformer works — a plain-language guide.
"Attention Is All You Need" changed the industry, but it does not belong in a product review meeting. This is the version for builders: one mechanism called attention, reweighting importance between tokens based on context — without a single equation.
Share this article