Single source of truth for the fleet inference stack: claude-relay tiers, inference_proxy chokepoint, critic route law, OpenAI-API kill-switch. Stamped 2026-06-05 · canon:rule-openai-api-off-by-default-2026-06-05
The critic is Sabour's ChatGPT-Pro SUBSCRIPTION. It has no quota. Nothing to "top up". Any 429/"quota exhausted" symptom = you are off-route (canon:rule-critic-has-no-quota-2026-06-05).
NEVER send gpt-5.5/gpt-* model names — they return 403 openai_api_disabled. The model name chatgpt-pro IS the routing sentinel; the bridge pins gpt-5.5 server-side. Direct 10.99.0.2:4242 = health checks only. Audit history showed zero proxied critic calls 17 May → 5 Jun because canon taught the wrong primary; fixed 2026-06-05.
3 · claude-relay :8896 — tier table (live state 2026-06-05)
Tier
State
Target
Notes
0-MiniMax-Subscription
ON
api.minimax.io/anthropic
Default first hop for fleet (cascade-minimax-first)
Tool-semantic turns diverted off the bridge to MiniMax/Sonnet
Cost guards
Input/output token ceilings → 413 before any spend
Auth
Bearer <vault inference-proxy-token>; public paths get it injected by nginx after fleet x-api-key gate
Service: inference-proxy.service · Code: /home/ubuntu/api/inference_proxy.py · Health: GET :8899/health
5 · V-Vision — verified cascade
Image/document turns: MiniMax-M3 (free under subscription, multimodal-confirmed) → Claude OAuth Haiku → OpenAI gpt-4o (last resort, now flag-gated §6). Vision does not use the PC bridge. Tier stays ENABLED — it works fully on the first two legs. canon:rule-v-vision-mm3-cascade-2026-06-03
6 · OpenAI API kill-switch
The OpenAI API is OFF by default. The ONLY switch is the presence of /home/ubuntu/api/openai_api_enabled — created by Sabour, nobody else.
Enforced in three places: (1) inference_proxy → 403 openai_api_disabled for ^gpt-*; (2) relay Tier 10 config-disabled; (3) openaiDirectCall() + V-Vision OpenAI leg flag-gated in code.
Symptom dictionary: 429 insufficient_quota on a gpt-* model = you hit the dead key on a forbidden route. 403 openai_api_disabled = the gate working as designed — switch to chatgpt-pro.
APOLLO held DCOA for hours claiming "relay's gpt-5 quota is exhausted… genuine infra failure… awaiting Sabour top-up". Audit truth: at 17:37 it probed gpt-4o, gpt-4, gpt-4-turbo, gpt-4o-mini, gpt-3.5-turbo through the proxy; each matched ^gpt- → api.openai.com dead key → 429. Wrong model name, not quota. The bridge was healthy the entire time (verified from APOLLO's own seat: 4.4s round-trip). Fixes: this fact sheet, the 403 gate, Tier 10 disable, canon re-stamp, route correction posted in-room (msgs 6947/6951).