# Inference vs training for LLMs — who pays for what.


*AI · Models · April 2026 · 7 min read*


Training might run once (or for many hours) and you pay a cluster bill. Inference runs forever and turns a model into a per-token Opex line. This article separates the two checkbooks so pilot budgets are not mixed with product bills [1].

In a finance review someone asks, how much does the model cost, and one side answers with a training job quote while another answers with per-token API pricing. Both are right — in different contexts. Mixing them breaks Capex vs Opex planning [4].

Inference is running the model for a user request. Training is updating weights (or adapters) on batches. The team that treats them as one line item is surprised when usage scales [1]; see also the [SLM vs API](/en/journal/slm-local-vs-api-economics-2026) article and the [2026 LLM guide](/en/journal/what-is-llm-complete-guide-2026).


## What you pay in training.
Pretrain or finetune consumes GPU hours — memory-heavy runs, data pipelines, and repeated batch passes. You usually price it as $/GPU-hour (or cluster-hour) plus storage, checkpoints, and validation [4].

Even [LoRA](/en/journal/what-is-lora-efficient-fine-tuning-2026) is still training: it still ships files and still needs evaluation discipline [3].


## What you pay in inference.
Inference charges per user flow: request volume, context length, KV memory footprint, and your SLO. Cloud vendors price per million tokens; private serving prices hours and power [1][5].

A model trained once is not free to run: you convert a one-time build story into a continuous token economy [4].


> Training buys behavior. Inference bills you every time that behavior is used. If you mix the two, you mix your budget too.


## A decision table: when inference dominates the math.
- When production tokens/month beat the margin you hoped to get from more training [5].
- When product latency and concurrency matter — then [PagedAttention](/en/journal/what-is-pagedattention-llm-serving-2026) and vLLM class engines enter the cost model [2].
- When data crosses borders — [PDPL + AI](/en/journal/oman-pdpl-2022-impact-on-ai-2026) arguments apply to both training and serving [4].


## Frequently asked questions.
- Is API always inference? Economically, yes: you are paying a vendor for inference and surrounding infrastructure, not for owning weights [1].
- Is training always more expensive? Not always: a short heavy training project can be cheaper than years of very large-scale inference at list price [5].
- What do I put in a contract? Separate caps for inference (tokens) from caps for finetune / adapter update cycles [3].
- How do I compare quotes? Request the same workload: same precision, p95, and $/1M tokens — not incomparable microbenchmarks [5].
- Where do [GPU families](/en/journal/l40s-a100-h100-gpu-task-matrix-2026) show up? Different chips move both training and inference — just at different $/useful token [2].


## Closing and invitation.
Split the spreadsheet: a row for one-time or periodic training, and a row for monthly tokens for inference. Without that split, engineering wins and finance pain point at each other [4].

This quarter, write one number: how many million tokens per month in production. If you beat the plan, you are not tuning the model — you are learning inference economics [5].


## Sources.
[1] OpenAI — API pricing (per-token, verify current). https://openai.com/api/pricing/

[2] NVIDIA — Data center GPU product families. https://www.nvidia.com/en-us/data-center/

[3] Hu et al. — LoRA (ICLR 2022). https://arxiv.org/abs/2106.09685

[4] Nuqta — internal TCO templates, April 2026.

[5] Cloud pricing patterns — match to your provider and contract tier.