Your AI.Your infrastructure.
Run powerful AI models without sending data anywhere. Built for Middle East businesses that take data privacy seriously.
How it all fits together
Three pieces of software that work better together than apart. We've tested this combination extensively—it's what we'd use ourselves.
NVIDIA Dynamo
Orchestration Layer
Disaggregated prefill and decode for reasoning models. KV cache offloading via NIXL. Handles traffic spikes without crashes.
SGLang
Control Layer
Structured generation with JSON schema enforcement. RadixAttention caches system prompts. No more retry loops or formatting errors.
vLLM
Inference Engine
PagedAttention for memory efficiency. Supports Jais, ALLAM, Qwen 2.5, Llama 3. 2-4x more users per GPU than standard deployments.
In practice: Dynamo handles the heavy lifting of routing and resource management. SGLang makes sure outputs are formatted correctly (no more parsing errors). vLLM does the actual AI inference efficiently. We've run this setup on everything from a single A100 to clusters with 100+ GPUs—it scales well.
Models that actually understand Arabic
We've tested these extensively in production. Some are regional favorites (Jais, ALLAM), others are just really good at Arabic. All of them run smoothly on our stack.
Jais
UAE (G42, MBZUAI)
The UAE's sovereign model. Trained specifically on Arabic-English business text. We support it natively with full PagedAttention optimization.
ALLAM
KSA (SDAIA)
Saudi Arabia's national AI model. Required for many government contracts in the Kingdom. Runs on our vLLM backend.
Qwen 2.5
Alibaba Cloud
Outstanding Arabic performance. Handles massive context windows that need Dynamo's KV cache offloading to work properly.
Llama 3.1
Meta
The reliable workhorse. Great multilingual support including Arabic. Works well for general-purpose tasks across industries.
We also support DeepSeek, Mistral, and pretty much anything on Hugging Face. If it runs on vLLM, we can deploy it.
Actually simple to use
SGLang lets you define exactly what format you want back. The model literally can't output invalid JSON—it's constrained at the token level. No more regex parsing or retry loops.
- No parsing headaches: The output always matches your schema. No exceptions.
- System prompts cached: Those long instruction blocks? Cached automatically. Saves compute and time.
- Model agnostic: Works with Jais, ALLAM, Llama, Qwen—whatever you prefer. Same API.
# Deploy Jais model with SGLang
from sglang import function, gen, system, user
@function
def extract_visa_application(s, image_input):
s += system("Extract applicant details from Arabic documents.")
s += user(image_input)
s += assistant(
gen("json_output",
regex=r'\{"name":"[^"]+","passport":"[A-Z0-9]+","nationality":"[^"]+"\}')
)
# Guaranteed JSON output. No retry loops.
result = extract_visa_application.run(image="scan.jpg")Real numbers from real deployments
These are actual benchmarks from systems we've deployed. No cherry-picked best-case scenarios—just honest performance data.
Throughput
Memory Efficiency
Reasoning Models (DeepSeek-R1)
All benchmarks from H100 and A100 deployments. Results vary based on model size, prompt length, and your specific hardware setup. Happy to run tests on your infrastructure if you want.
The tech stack built for sovereign AI
NVIDIA Dynamo for orchestration. SGLang for structured generation. vLLM for inference. All running on your infrastructure.
NVIDIA Dynamo orchestration
Disaggregated prefill and decode. Handles reasoning models like DeepSeek-R1 at scale. 30x better throughput than standard deployments.
Native Arabic support
Optimized for Jais, ALLAM, and Qwen 2.5. Efficient tokenization for Arabic script. Works with Gulf dialects, not just MSA.
SGLang structured generation
Enforces JSON schemas. Caches system prompts with RadixAttention. No more retry loops when outputs need strict formatting.
PDPL & NDMO compliant
Built for Saudi and UAE data residency laws
vLLM inference engine
PagedAttention for 2-4x more throughput per GPU
KV cache offloading
Handle traffic spikes without crashes
Where we've actually deployed
Real projects in KSA and UAE. Government ministries, banks, energy companies. Each one had different constraints—here's how we solved them.
Government & Public Sector
Riyadh Ministry
Challenge
Launch a citizen services app with sensitive National ID data. Can't use foreign APIs due to PDPL Article 29.
Our approach
On-premise deployment in the Ministry's private cloud. Handles traffic spikes during budget announcements. SGLang ensures the bot cites specific regulation articles.
Outcome
Full PDPL compliance. Zero data egress. Handles 10,000 concurrent users.
Banking & Fintech
Dubai Financial Center
Challenge
Extract data from Arabic loan PDFs and feed it to a legacy mainframe that only accepts strict JSON. Can't risk formatting errors.
Our approach
SGLang enforces rigid JSON schema—model can't generate syntax errors. RadixAttention caches the bank's 3,000-token underwriting policy.
Outcome
Zero retry loops. Near-instant response times. Mainframe integration works perfectly.
Energy & Petrochemicals
Edge deployment
Challenge
Analyze sensor logs from offshore drilling rigs. Poor connectivity. Terabytes of data. Can't send to cloud.
Our approach
Compact vLLM server on a single A100 at the edge. Dynamo handles batch processing overnight, separate from real-time safety queries.
Outcome
Predictive maintenance running locally. No cloud dependency. Works offline.
Compliance for KSA & UAE
Built for the strictest data residency laws in the Gulf. PDPL Article 29 compliant. NDMO approved architectures.
PDPL Compliant
Saudi data protection law
UAE Data Decrees
Federal & DIFC requirements
NDMO Standards
KSA data classification
NCA Framework
Cybersecurity controls
Technology partners