Vision · Infrastructure·April 2026·13 min read

Running a language model inside Oman.

The vision, the engineering, the open-source models we would deploy, and the real cost — for a full year. This is not a sales deck. It is the calculation we put on the table before any client conversation that starts with: why build instead of rent?

For two years, the number of calls we receive about "hosting a private model in Oman" has been doubling. Most begin with one line: "We cannot send our customers' data out of the country." Behind that line is a legal decision, a commercial decision, and sometimes a sovereignty decision. But the practical question that follows is not answered by a sentence: what is the real cost? Which model do we use? How do we serve it? And how do we know the move is worth it?

This piece is a serious attempt at an answer. Not a marketing summary, but an open design: the same calculation we put on the table for any client asking for a quote. We start with the vision, then name the models, then open the cost ledger, and end with the conditions that make the decision sensible — or not.

The vision: why Oman, not just "a local server"?

There is a difference between "we host on a local server" and "we run a language model inside the Sultanate as part of a national stack." The first is a technical choice. The second is a posture. The vision we work on rests on three axes:

Legal sovereignty: the Omani Personal Data Protection Law (2022) gives a clear framework. When the model and the data stay under that law, compliance does not need complex contracts with foreign parties.
Infrastructure readiness: Omani Tier III/IV data centers are available today (Madayn, Muscat, Salalah), connected to international subsea cables, with relatively cheap power compared to the northern Gulf.
Regional position: Oman is commercially neutral, close to Gulf, Indian, and African markets. An inference center in Muscat serves bots in Riyadh, Dubai, Salalah, and Mumbai with less than 60 ms latency.

The operational vision is not "a server in an office." It is an Inference-as-a-Service operated from Oman, on local hardware, with an open-source model fine-tuned to Gulf Arabic context, and a data pipeline that stays fully inside the border.

The proposed open-source models.

At Nuqta, over the past months, we evaluated more than twelve open models on internal Arabic benchmarks (Gulf conversations, translation, contract summarization, entity extraction from banking text). We settled on five serious candidates, with one primary recommendation. The figure below summarizes the matrix:

Fig. 1 — Candidate models. X-axis: parameters (billions). Y-axis: internal Arabic quality (0–100). Bubble size reflects relative run cost.

Why these five.

Qwen2.5-72B-Instruct (primary): excellent balance between Arabic quality, permissive license (Apache-style), and hardware footprint. Runs on 4× H100 in FP8 via vLLM at production throughput.
Jais-70B: the only model on the list trained natively in Arabic (G42 + MBZUAI + Cerebras). Best at dialect and classical grammar, slightly weaker on coding and logical reasoning.
Qwen2.5-32B: "middle tier" for fast tasks that do not need peak quality (classification, routing, short summaries). Runs on 2× H100, or 4× L40S.
Llama-3.3-70B: an excellent alternative when Meta's ecosystem is preferred, with the note that its license restricts use above 700M MAU (irrelevant to us in practice).
DeepSeek-V3: MoE model with 671B total parameters, 37B active. Exceptional in reasoning and coding. Needs larger hardware (8× H100) but delivers outstanding quality when the size budget is there.

The serving architecture.

Running a 70B model in production is not "load the model, then serve it." It is an eight-layer system where each layer solves a problem that only surfaces under load. The figure below shows the full architecture as we build it:

Fig. 2 — Full serving architecture. Solid line: live inference path. Dashed line: training/tuning pipeline running on a separate node.

Two techniques deserve a pause: PagedAttention and continuous batching. Together they raise throughput 3–5× compared to naïve serving. Without them, the same hardware delivers 20% of its potential. That is the difference between a competitive cost per token and a cost that makes a global API look cheap in comparison.

Fine-tuning (LoRA) runs on a separate node at only $700–1,200 per month, producing small adapters (200–500 MB instead of 140 GB) that hot-load into the base model without reloading. Which means we can ship a monthly improvement with zero service downtime.

The savings do not come in the first hour. They come in the second year. The first year is paid for in sovereignty, speed, and the ability to tune.

The real cost — a full year.

Most quotes that reach the client talk about "server cost." That is only 29% of the equation. The figure below shows the full-year composition for a production service on an 8×H100 node hosted in Muscat, at a scale serving 4–6 mid-size concurrent clients:

Fig. 3 — Year-one composition: $365K total. Operating monthly figure: $30,400.

Real math: what does a million tokens cost us?

At moderate service scale (2.5B tokens/month, i.e. 4–6 mid-size clients), our internal cost per million tokens is: $30,400 / 2,500 = $12.2. That is higher than a global API at 2026 prices ($8–10/M tokens for top-tier models).

At 5B tokens/month (8–12 clients): $6.08/M tokens. Clearly lower.

At 10B tokens/month: $3.04/M tokens. Three times lower.

The honest conclusion: private AI in Oman does not win on price in year one for a single client. It wins for several clients, or for one client at large scale, or in year two after hardware is amortized. This, in its simplest form, is a shared-infrastructure model.

What the number hides.

Latency: a global API adds 200–500 ms per call from Oman. A local service runs under 50 ms. In a chat bot, that is a felt difference in experience.
Tuning on private data: API models cannot be tuned on confidential banking documents. With private AI, tuning is a monthly routine.
Price stability: global API prices changed 5 times in two years. Owned hardware does not fluctuate.
No data leakage into training: many API terms implicitly allow inputs to be used for future model improvement. On-prem, the matter is settled.
Decision independence: when a foreign party changes its terms, you are a hostage. When you own the stack, you are the decider.

A phased rollout.

We never recommend starting with 8×H100 directly. Staging is the difference between success and bleed:

Phase 1 (month 1–3): a pilot node of 4×L40S or 2×A100. CAPEX $60–80K, operating $3–4K/month. Enough for a 32B model and 500M tokens/month. Goal: validate scenarios, collect real data, and ship the first LoRA.
Phase 2 (month 4–9): upgrade to 8×H100 and launch Qwen2.5-72B into production. Additional CAPEX $240–260K. Onboard 3–5 paying clients.
Phase 3 (month 10+): second node for geo-redundancy + elasticity. Target 5B+ tokens/month. This is where real price savings materialize.

When does it become sensible?

We distill the equation into five conditions. If four hold, we advise starting. If all five hold, starting is a must:

Expected aggregate consumption exceeding 3B tokens/month within 18 months.
At least 2–4 clients with explicit data-sovereignty needs (banks, health, government, law).
Willingness to commit to at least a 24-month horizon — below that, CAPEX does not get recovered.
A technical team able to operate the stack, or a partnership with an operator (we offer this).
A clear commercial vision: private AI is not a goal, it is a channel to a different service — custom bank bots, legal assistants, intelligent archives for government documents.

Closing — the invitation.

Building AI infrastructure inside Oman is not only a technical project. It is a small national choice: that the models serving an Omani bank, or the Ministry of Health, or the insurance company, remain under Omani law, in an Omani building, operated by an Omani team. The number we presented — $365K year one, $12.2 per million tokens at the start — is not cheap. But it is realistic, improves with scale, and does not require fantasy funding.

At Nuqta, we have already built phase one (4×L40S) as a test platform. We run Qwen2.5-32B on it today for two clients. We plan phase two in the next quarter. If you are reading this and have a use case that meets three of the five conditions above, we are happy to share the detailed calculation for your specific case — for free, in one meeting, with no contract. The numbers are built together, or they are not built.

Digital sovereignty: why your data should stay in Oman.
When you send your customers' data to a server in Frankfurt or Virginia, you are not hosting it. You are handing it over. The difference is not technical.
When does private AI become cheaper than a global API?
The question every CFO asks before approving an AI project. The answer is not "always" or "never" — it is a curve with a specific break-even point. This piece draws it.

← Back to the JournalNuqta · Journal