# L40S vs A100 vs H100 — which GPU for which job.


*AI · Infrastructure · April 2026 · 8 min read*


The question is not the fastest SKU on a slide. It is workload fit: heavy training, broad inference, or cost-per-watt chat serving? One matrix places L40S, A100, and the [H100 reference](/en/journal/nvidia-h100-gpu-ai-standard-2026) on the same decision axis — without hand-waving in procurement [1].

Cloud catalogs make chip names look like currencies. The failure mode is picking the newest generation when your workload is mostly steady-state serving — where memory, batching, and good engines matter as much as peak TFLOPS [1].

This article complements the [H100 explainer](/en/journal/nvidia-h100-gpu-ai-standard-2026): it pulls L40S and A100 into the same practical comparison [2].


## Three families, fast.
H100 (Hopper): the datacenter class most cited for heavy training and large-scale inference; Tensor Cores, wide HBM, and the ecosystem default for new GenAI TCO models [2].

A100 (Ampere): still common in broad training and HPC-style stacks; a balanced historical workhorse in many price lists [2].

L40S (Ada): often a strong play for efficient inference, visualization-adjacent stacks, and power-conscious deployments; not a universal H100 replacement for every training run [1].


> The right question is not which chip. It is: which chip hits your SLO at your context size and your concurrency for the same $/M tokens.


## FIG. 1 — train vs serve (simplified).
*[Figure: FIG. 1 — GPU FAMILY ↔ PRIMARY WORKLOAD (SCHEMATIC)]*


## A Nuqta-field rule.
We often start pilots on smaller cards and only scale once the same tokens/second and latency are measured on identical prompts — that keeps procurement honest [5].

Pair this with [inference vs training economics](/en/journal/inference-vs-training-llm-economics-2026): training spends hours; inference spends forever tokens [4].


## Frequently asked questions.
- Is L40S enough for hard Arabic workloads? It depends on model size and context — not the language name [3].
- Should I always jump from A100 to H100? Not if the bottleneck is a serving engine — sometimes software fixes the ceiling before you buy silicon [2].
- How do I make vendor A vs B comparable? Fix precision, driver, and engine versions before comparing tokens/sec [3].
- What about [vLLM](/en/journal/what-is-vllm-production-serving-2026)? It raises throughput — it does not change GPU physics [2].
- Does this apply in Oman? Supply, contracts, and colocation still filter which SKUs you can land — also read [digital sovereignty in Oman](/en/journal/digital-sovereignty-oman) [5].


## Closing.
NVIDIA public pages describe the product lines — the decision still needs a measured load, not a catalog guess [1][2].

Ask your vendor for a line: same load, same batch, same context — then compare $/1M useful tokens to match your SLO [4].


## Sources.
[1] NVIDIA — L40S GPU (product). https://www.nvidia.com/en-us/data-center/l40s/

[2] NVIDIA — H100 / A100 data center overviews. https://www.nvidia.com/en-us/data-center/

[3] MLCommons — MLPerf Inference. https://mlcommons.org/

[4] Nuqta — internal GPU procurement notes, April 2026.

[5] Nuqta — pilot-to-prod playbooks, April 2026.
