GPU Platform — Inbound / Outbound Network Flows

How GPU providers, workers, the gateway and Redis talk to each other — and exactly what changes when we deploy into CAE / CCE Kubernetes, where Redis is reachable only via in-cluster Service DNS (no public LoadBalancer / ELB).

Source: Scicom-AI-Enterprise-Organization/GPUPlatform @ main · code-verified architecture audit · for team discussion

Decision being visualised: in CAE, run the GPU platform on physical machines only. The VM provider stays — its workers reach in-cluster Redis over the Tailscale tailnet (CCE ⇄ VM VPN). The cloud GPU providers — RunPod & Prime Intellect — are disabled, because their workers run outside the cluster and would need Redis exposed publicly (or a fragile per-pod reverse tunnel; Prime Intellect has no tunnel path at all).

CAE — physical machines only, Redis stays in-cluster

Thick green = direct Redis (the hot path: job queue, token stream, result, cancel). Blue = gateway HTTP (register / heartbeat / logs / metrics). Teal = carried over the Tailscale tailnet. Dashed red = blocked / disabled.

flowchart LR
  CL["Clients / SDK
OpenAI-compatible API"]
  VM["Physical VM · bare-metal GPU
worker-agent + vLLM"]
  subgraph K8S["CAE · CCE Kubernetes (restricted namespace)"]
    direction TB
    GW["API Gateway
:8080"]
    RD[("Redis
ClusterIP · headless
public NLB: OFF")]
    PG[("Postgres")]
  end
  subgraph DIS["Disabled in CAE — external cloud GPU"]
    direction TB
    RP["RunPod provider"]
    PI["Prime Intellect provider"]
  end
  CL -->|"HTTPS · ingress"| GW
  GW <-->|"direct redis · queue / result / pub-sub / state"| RD
  GW ---|"async"| PG
  VM -->|"HTTP · register / heartbeat / logs / metrics"| GW
  VM ==>|"direct redis over Tailscale tailnet:
BRPOP queue · PUBLISH stream · SET result · EXISTS cancel"| RD
  RP -.->|"needs public Redis or reverse-tunnel"| RD
  PI -.->|"no tunnel path — needs public Redis"| RD
  linkStyle 0 stroke:#3b82f6,stroke-width:2.5px;
  linkStyle 1 stroke:#10b981,stroke-width:4px;
  linkStyle 2 stroke:#3a4a5e,stroke-width:1.5px,stroke-dasharray:4 4;
  linkStyle 3 stroke:#3b82f6,stroke-width:2.5px;
  linkStyle 4 stroke:#14b8a6,stroke-width:4px;
  linkStyle 5 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 5;
  linkStyle 6 stroke:#ef4444,stroke-width:2px,stroke-dasharray:6 5;
  classDef k8s fill:#0c1b16,stroke:#10b981,color:#d1fae5;
  classDef ext fill:#101826,stroke:#3b82f6,color:#dbeafe;
  classDef dis fill:#1c0f12,stroke:#ef4444,color:#fecaca,stroke-dasharray:5 4;
  class GW,RD,PG k8s;
  class CL,VM ext;
  class RP,PI dis;

Direct Redis (hot path)

Direct Redis over Tailscale tailnet

Gateway HTTP

Blocked / disabled in CAE

In CAE the Helm chart is already correct by default: redis.publicLoadBalancer.enabled=false and workerRedisUrl="" → the gateway hands workers the in-cluster Redis DNS. Only the provider set needs locking down.

Job lifecycle in CAE — who calls whom

Inbound to the cluster vs. outbound from it, over time. The VM → Redis legs ride the tailnet; everything to the gateway is HTTP.

sequenceDiagram
  autonumber
  participant C as Client
  participant G as Gateway (k8s)
  participant R as Redis (ClusterIP)
  participant W as worker-agent (VM · tailnet)
  W->>G: register (HTTP) — request redis_url
  G-->>W: redis_url = in-cluster Service DNS
  loop every 5s
    W->>G: heartbeat / logs / metrics (HTTP)
  end
  C->>G: POST /v1/chat/completions (HTTPS)
  G->>R: LPUSH queue:{app_id}
  W->>R: BRPOP queue:{app_id}  (over tailnet)
  W->>R: PUBLISH stream:{id} · SET result:{id}
  R-->>G: SUBSCRIBE stream:{id} · GET result:{id}
  G-->>C: SSE / JSON response

Today (prod) — cloud GPU workers reach Redis from outside the cluster

Cloud workers run outside the cluster, so they reach Redis either through a per-pod reverse SSH tunnel (RunPod, purple) or a public AWS NLB (Prime Intellect / non-tunnel, orange). This is exactly the public exposure CAE forbids.

flowchart LR
  CL["Clients / SDK"]
  RPP["RunPod pod
worker-agent + vLLM
(external cloud GPU)"]
  PIP["Prime Intellect pod
(external cloud GPU)"]
  VM["Physical VM
(tailnet)"]
  subgraph K8S["Kubernetes (prod)"]
    direction TB
    GW["API Gateway :8080"]
    RD[("Redis")]
    NLB(["Public AWS NLB
redis.publicLoadBalancer"])
  end
  CL -->|"HTTPS"| GW
  GW <-->|"direct redis"| RD
  RD ---|"exposes :6379"| NLB
  RPP -->|"HTTP register / heartbeat / logs / metrics"| GW
  RPP ==>|"reverse SSH tunnel → pod loopback → Redis"| RD
  PIP -->|"HTTP"| GW
  PIP ==>|"direct redis via public NLB"| NLB
  VM ==>|"direct redis via tailnet"| RD
  linkStyle 0 stroke:#3b82f6,stroke-width:2.5px;
  linkStyle 1 stroke:#10b981,stroke-width:4px;
  linkStyle 2 stroke:#f59e0b,stroke-width:2.5px;
  linkStyle 3 stroke:#3b82f6,stroke-width:2.5px;
  linkStyle 4 stroke:#a855f7,stroke-width:4px;
  linkStyle 5 stroke:#3b82f6,stroke-width:2.5px;
  linkStyle 6 stroke:#f59e0b,stroke-width:4px;
  linkStyle 7 stroke:#14b8a6,stroke-width:4px;
  classDef k8s fill:#0c1b16,stroke:#10b981,color:#d1fae5;
  classDef ext fill:#101826,stroke:#3b82f6,color:#dbeafe;
  classDef pub fill:#241a07,stroke:#f59e0b,color:#fde68a;
  class GW,RD k8s;
  class NLB pub;
  class CL,RPP,PIP,VM ext;

Direct Redis (in-cluster)

Reverse SSH tunnel (RunPod)

Tailnet (VM)

Public NLB exposure

Gateway HTTP

Per-provider verdict for CAE (svc-only Redis)

VM / bare-metal KEEP

Workers reach in-cluster Redis over the Tailscale tailnet. No public exposure. The CAE path.

fake / in-cluster KEEP

Runs inside the gateway pod → ClusterIP Redis resolves directly. Good for dev/smoke tests.

RunPod DISABLE

External pods. Only reach private Redis via a per-pod reverse SSH tunnel — fragile: ephemeral keys lost on gateway restart, single-gateway only.

Prime Intellect DISABLE

External pods with no reverse-tunnel code path at all. Cannot reach a private Redis — would force a public NLB.

Inbound → Redis path inventory

Every distinct path that touches Redis, and whether it survives a cluster-internal-only Redis. The four direct-redis worker legs are the only ones that break — and only when the worker runs outside the cluster without a tunnel/tailnet.

Path	Origin	Transport	Needs public Redis?	Survives svc-only?
Job dispatch — `BRPOP queue:{app}`	worker (out)	direct-redis	Yes*	tunnel / tailnet
Streaming — `PUBLISH stream:{id}`	worker (out)	direct-redis	Yes*	tunnel / tailnet
Result write — `SET result:{id}`	worker (out)	direct-redis	Yes*	tunnel / tailnet
Cancel poll — `EXISTS cancel:{id}`	worker (out)	direct-redis	Yes*	tunnel / tailnet
Worker register / heartbeat / logs / metrics	worker (out)	gateway-HTTP	No	Yes
Client inference (REST / SSE)	client (out)	gateway-HTTP	No	Yes
Gateway registry / autoscaler / reconciler	gateway (in)	direct-redis	No	Yes
Bench & training log streams	gateway (in)	direct-redis	No	Yes
Redis public LoadBalancer (AWS NLB)	chart → internet	direct-redis	Yes	No — keep OFF
Redis internal headless Service	chart → in-cluster	direct-redis	No	Yes

*"Yes*" only when the worker is outside the cluster. With the VM tailnet (CAE) these legs ride the VPN and never need public Redis. With cloud pods (RunPod/PI) they do — which is why those providers are disabled.

Generated for CAE deployment planning. Diagrams render client-side via Mermaid (CDN). For an offline copy, print to PDF. Companion write-up: docs/CAE_REDIS_EXPOSURE.md in the repo.