GPU Platform — Inbound / Outbound Network Flows

How GPU providers, workers, the gateway and Redis talk to each other — and exactly what changes when we deploy into CAE / CCE Kubernetes, where Redis is reachable only via in-cluster Service DNS (no public LoadBalancer / ELB).

Source: Scicom-AI-Enterprise-Organization/GPUPlatform @ main · code-verified architecture audit · for team discussion
Decision being visualised: in CAE, run the GPU platform on physical machines only. The VM provider stays — its workers reach in-cluster Redis over the Tailscale tailnet (CCE ⇄ VM VPN). The cloud GPU providers — RunPod & Prime Intellect — are disabled, because their workers run outside the cluster and would need Redis exposed publicly (or a fragile per-pod reverse tunnel; Prime Intellect has no tunnel path at all).

CAE — physical machines only, Redis stays in-cluster

Thick green = direct Redis (the hot path: job queue, token stream, result, cancel). Blue = gateway HTTP (register / heartbeat / logs / metrics). Teal = carried over the Tailscale tailnet. Dashed red = blocked / disabled.

flowchart LR
  CL["Clients / SDK
OpenAI-compatible API"] VM["Physical VM · bare-metal GPU
worker-agent + vLLM"] subgraph K8S["CAE · CCE Kubernetes (restricted namespace)"] direction TB GW["API Gateway
:8080"] RD[("Redis
ClusterIP · headless
public NLB: OFF")] PG[("Postgres")] end subgraph DIS["Disabled in CAE — external cloud GPU"] direction TB RP["RunPod provider"] PI["Prime Intellect provider"] end CL -->|"HTTPS · ingress"| GW GW <-->|"direct redis · queue / result / pub-sub / state"| RD GW ---|"async"| PG VM -->|"HTTP · register / heartbeat / logs / metrics"| GW VM ==>|"direct redis over Tailscale tailnet:
BRPOP queue · PUBLISH stream · SET result · EXISTS cancel"| RD RP -.->|"needs public Redis or reverse-tunnel"| RD PI -.->|"no tunnel path — needs public Redis"| RD linkStyle 0 stroke:#3b82f6,stroke-width:2.5px; linkStyle 1 stroke:#10b981,stroke-width:4px; linkStyle 2 stroke:#3a4a5e,stroke-width:1.5px,stroke-dasharray:4 4; linkStyle 3 stroke:#3b82f6,stroke-width:2.5px; linkStyle 4 stroke:#14b8a6,stroke-width:4px; linkStyle 5 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 5; linkStyle 6 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 5; classDef k8s fill:#0c1b16,stroke:#10b981,color:#d1fae5; classDef ext fill:#101826,stroke:#3b82f6,color:#dbeafe; classDef dis fill:#1c0f12,stroke:#ef4444,color:#fecaca,stroke-dasharray:5 4; class GW,RD,PG k8s; class CL,VM ext; class RP,PI dis;
Direct Redis (hot path)
Direct Redis over Tailscale tailnet
Gateway HTTP
Blocked / disabled in CAE
In CAE the Helm chart is already correct by default: redis.publicLoadBalancer.enabled=false and workerRedisUrl="" → the gateway hands workers the in-cluster Redis DNS. Only the provider set needs locking down.

Job lifecycle in CAE — who calls whom

Inbound to the cluster vs. outbound from it, over time. The VM → Redis legs ride the tailnet; everything to the gateway is HTTP.

sequenceDiagram
  autonumber
  participant C as Client
  participant G as Gateway (k8s)
  participant R as Redis (ClusterIP)
  participant W as worker-agent (VM · tailnet)
  W->>G: register (HTTP) — request redis_url
  G-->>W: redis_url = in-cluster Service DNS
  loop every 5s
    W->>G: heartbeat / logs / metrics (HTTP)
  end
  C->>G: POST /v1/chat/completions (HTTPS)
  G->>R: LPUSH queue:{app_id}
  W->>R: BRPOP queue:{app_id}  (over tailnet)
  W->>R: PUBLISH stream:{id} · SET result:{id}
  R-->>G: SUBSCRIBE stream:{id} · GET result:{id}
  G-->>C: SSE / JSON response
      

Today (prod) — cloud GPU workers reach Redis from outside the cluster

Cloud workers run outside the cluster, so they reach Redis either through a per-pod reverse SSH tunnel (RunPod, purple) or a public AWS NLB (Prime Intellect / non-tunnel, orange). This is exactly the public exposure CAE forbids.

flowchart LR
  CL["Clients / SDK"]
  RPP["RunPod pod
worker-agent + vLLM
(external cloud GPU)"] PIP["Prime Intellect pod
(external cloud GPU)"] VM["Physical VM
(tailnet)"] subgraph K8S["Kubernetes (prod)"] direction TB GW["API Gateway :8080"] RD[("Redis")] NLB(["Public AWS NLB
redis.publicLoadBalancer"]) end CL -->|"HTTPS"| GW GW <-->|"direct redis"| RD RD ---|"exposes :6379"| NLB RPP -->|"HTTP register / heartbeat / logs / metrics"| GW RPP ==>|"reverse SSH tunnel → pod loopback → Redis"| RD PIP -->|"HTTP"| GW PIP ==>|"direct redis via public NLB"| NLB VM ==>|"direct redis via tailnet"| RD linkStyle 0 stroke:#3b82f6,stroke-width:2.5px; linkStyle 1 stroke:#10b981,stroke-width:4px; linkStyle 2 stroke:#f59e0b,stroke-width:2.5px; linkStyle 3 stroke:#3b82f6,stroke-width:2.5px; linkStyle 4 stroke:#a855f7,stroke-width:4px; linkStyle 5 stroke:#3b82f6,stroke-width:2.5px; linkStyle 6 stroke:#f59e0b,stroke-width:4px; linkStyle 7 stroke:#14b8a6,stroke-width:4px; classDef k8s fill:#0c1b16,stroke:#10b981,color:#d1fae5; classDef ext fill:#101826,stroke:#3b82f6,color:#dbeafe; classDef pub fill:#241a07,stroke:#f59e0b,color:#fde68a; class GW,RD k8s; class NLB pub; class CL,RPP,PIP,VM ext;
Direct Redis (in-cluster)
Reverse SSH tunnel (RunPod)
Tailnet (VM)
Public NLB exposure
Gateway HTTP

Per-provider verdict for CAE (svc-only Redis)

VM / bare-metal KEEP

Workers reach in-cluster Redis over the Tailscale tailnet. No public exposure. The CAE path.

fake / in-cluster KEEP

Runs inside the gateway pod → ClusterIP Redis resolves directly. Good for dev/smoke tests.

RunPod DISABLE

External pods. Only reach private Redis via a per-pod reverse SSH tunnel — fragile: ephemeral keys lost on gateway restart, single-gateway only.

Prime Intellect DISABLE

External pods with no reverse-tunnel code path at all. Cannot reach a private Redis — would force a public NLB.

Inbound → Redis path inventory

Every distinct path that touches Redis, and whether it survives a cluster-internal-only Redis. The four direct-redis worker legs are the only ones that break — and only when the worker runs outside the cluster without a tunnel/tailnet.

PathOriginTransportNeeds public Redis?Survives svc-only?
Job dispatch — BRPOP queue:{app}worker (out)direct-redisYes*tunnel / tailnet
Streaming — PUBLISH stream:{id}worker (out)direct-redisYes*tunnel / tailnet
Result write — SET result:{id}worker (out)direct-redisYes*tunnel / tailnet
Cancel poll — EXISTS cancel:{id}worker (out)direct-redisYes*tunnel / tailnet
Worker register / heartbeat / logs / metricsworker (out)gateway-HTTPNoYes
Client inference (REST / SSE)client (out)gateway-HTTPNoYes
Gateway registry / autoscaler / reconcilergateway (in)direct-redisNoYes
Bench & training log streamsgateway (in)direct-redisNoYes
Redis public LoadBalancer (AWS NLB)chart → internetdirect-redisYesNo — keep OFF
Redis internal headless Servicechart → in-clusterdirect-redisNoYes
*"Yes*" only when the worker is outside the cluster. With the VM tailnet (CAE) these legs ride the VPN and never need public Redis. With cloud pods (RunPod/PI) they do — which is why those providers are disabled.
Generated for CAE deployment planning. Diagrams render client-side via Mermaid (CDN). For an offline copy, print to PDF. Companion write-up: docs/CAE_REDIS_EXPOSURE.md in the repo.