CogOS
A deterministic uptime loop for production AI. Same call → same bytes out. Same call next month → same bytes out. Same call under load → no rate limit, no throttle, no provider drift. The mechanism that makes AI-backed features safe to ship.
🟢 Live now: this gateway is serving real traffic. Hit
/health for the heartbeat.
Every claim below is verifiable in
the public bench —
open-source, MIT, run it yourself with any provider's credentials.
The mechanism
Deterministic
Every call is a closed function: input → bytes out. Schema-locked at the decoder level (the model physically can't emit non-conforming JSON). Sampling settings pinned, temperature 0 by default. Run the same prompt 20 times, get 20 identical responses. Verifiable via the public bench — we re-run it against our live inference path on a published cadence so determinism is something you can audit, not something we ask you to take on faith.
Uptime
Local inference, no third-party rate limit, no provider snapshot rotation, no ToS surface that can change under you. Your plan's request budget is yours — burst as hard as you need within it. The loop stays up because there's no remote dependency to fail.
Loop
Request → constrained decode → schema-validated response → provenance event → metered usage. Every step deterministic, every step observable, every step replayable from the hash-chained event log. The substrate isn't an LLM endpoint; it's a loop you can build production code on.
What breaks without it
| What breaks in production today | What CogOS guarantees |
|---|---|
| The model returned malformed JSON in prod. Worked fine in dev. You're debugging the LLM, not your code. | Schema-locked decoding at the token level. Pass a JSON Schema, the decoder is physically constrained. Non-conforming output is impossible — not retried, prevented. |
| Your code stopped working two weeks ago. No one touched it. The provider rotated the model behind the same name. | The public bench runs against our live path on a published cadence. Drift shows up in the CSV the same day. Customers see the same audit we see. No "trust us" — the receipts are open. |
| 3 requests per minute on the starter tier. Your batch job runs at 3am. You wake to angry customers at 7. | 100,000 requests/month, no per-minute throttle. Burst as hard as your business needs. No tier ladder to climb before you can scale. |
| "Temperature zero" is best-effort. Same input, different bytes, no reproducible test runs. | Byte-identical outputs at temperature 0. Verifiable — 20 identical calls return 1 unique output. Determinism = 1.0000. Provable. |
| Compliance asks where the inference happens. You don't know exactly. Their counsel doesn't sign off. | Local inference, no data egress to third-party clouds. Your provenance log is hash-chained, queryable, auditable. |
How the loop is built
A runtime, not a model
Open-weight models (Qwen, Llama, Mistral) are commodities. CogOS is the runtime layer above them — grammar-constrained decoders, tier routing per task shape, provenance events on every call, and an open determinism bench that audits the inference path on a published cadence. The model is the CPU. CogOS is the OS that makes it operable. The loop is what you ship against.
Drop-in for your existing chat-completions client
The API speaks the same POST /v1/chat/completions shape your current SDK already sends. Point your client at https://cogos.5ceos.com/v1 and try it. If you don't like it, change it back in ten seconds.
Tier-routed by task, not by guess
Use model: "cogos-tier-b" for classification-shaped work, "cogos-tier-a" for narrative. The router picks the right size of open-weight model per shape — sufficient is sufficient, the GreenOps doctrine.
Pricing
Operator Starter
$29/mo
100,000 requests/mo · Tier B · schema-locked decoding · deterministic at temp=0
100,000 schema-locked requests per month on Tier-B (classification-shaped workloads).
Operator Pro
$99/mo
500,000 requests/mo · Tier A + Tier B · schema-locked decoding · deterministic at temp=0
1M requests/month, Tier-A narrative + Tier-B classification.
Operator Team
$299/mo
2,000,000 requests/mo · Tier A + Tier B · schema-locked decoding · deterministic at temp=0
Small startup, multiple engineers. 2M requests/month, both tiers, 99.0% SLA, multi-key rotation.
Compliance
$1,500/mo
5,000,000 requests/mo · Tier A + Tier B · schema-locked decoding · deterministic at temp=0
Regulated industries. 5M requests/month, both tiers, 99.5% SLA, SOC 2 Type II, DPA + BAA, phone support.
Enterprise
$100,000/yr
50M requests/mo · dedicated GPU container · single-tenant · 99.9% SLA · SOC 2 Type II · MSA + DPA + BAA · quarterly business review · 12-month minimum
Real deals close at $100K–$250K depending on add-ons (extra GPUs, 99.95% SLA, on-prem deployment, dedicated CSM).
Talk to sales →
Try it in 30 seconds (after signup)
curl https://cogos.5ceos.com/v1/chat/completions \
-H "Authorization: Bearer sk-cogos-..." \
-H "Content-Type: application/json" \
-d '{
"model": "cogos-tier-b",
"messages": [{"role":"user","content":"Capital of France?"}],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "answer",
"strict": true,
"schema": {
"type": "object",
"required": ["country","capital"],
"properties": {
"country": {"type":"string"},
"capital": {"type":"string"}
}
}
}
}
}'
FAQ
Why should I trust you on determinism?
Don't. Clone the bench and run it. MIT-licensed, open methodology, hand-coded rubrics — every claim on this page becomes a CSV you can publish or attack.
What models?
Qwen 2.5 (3B and 7B) today. Open-weight, content-addressed. New tiers (Llama 3.3, Mistral) land as discrete versioned upgrades — no silent swaps. The bench is re-run against the live inference path so any drift is published, not hidden.
What happens at your monthly quota?
A clean 429 with X-Cogos-Quota-Reset pointing at the start of the next billing cycle. Upgrade to a higher-quota package or wait for next cycle. Plans aren't lottery tickets — you know what you're getting.