VPS-per-User
Operator notes for customer VPS provisioning, backup, and recovery.
VPS-per-User
Matrix OS production user runtime is VPS-native: one customer VPS per active user. The platform remains the control plane for auth, routing, provisioning, integrations, R2 host-bundle publication, upgrades, and recovery.
Implementation history and production verification notes are tracked in specs/070-vps-per-user/changelog.md.
Production Scope
Included:
- Lazy provisioning through
POST /vps/provisionfor an authenticated internal operator request. - One-time host registration through
POST /vps/register. - VPS-first routing for users with a
runninguser_machinesrow. - Shared
code.matrix-os.comrouting to the authenticated user's VPS-hosted code-server gateway. - Customer host restore gate, hourly Postgres backups, and R2 metadata pointers.
- Manual recovery through
POST /vps/recoverormatrixctl recover. - Host-bundle based updates for shell, gateway, code, default apps, and runtime CLIs.
- Local owner-controlled Postgres on each customer VPS at
127.0.0.1:5432.
Not included:
- Automatic unreachable detection and replacement.
- Sleep, warm pools, idle deletion, and geographic routing.
- Data deletion from R2 during phase-1 VPS deletion.
Cost And Quota
The default server type is controlled by HETZNER_SERVER_TYPE and currently targets cpx22. Before adding a customer, confirm:
- The Hetzner customer project has quota for one additional server.
- The expected monthly cost is accepted by the operator.
- The customer is explicitly opted in.
CUSTOMER_VPS_ENABLED=trueis set only in the intended environment.
Quota ceiling: one VPS per active Clerk user. Do not batch-enable users until recovery and rollback have been exercised for a non-production account.
Routing
code.matrix-os.com is a single public entrypoint. The platform authenticates the Clerk session or matrix_code_session, resolves the user to a running VPS, strips user cookies and authorization headers, then forwards to that VPS over HTTPS with platform proof headers. If no running VPS exists, the platform returns an unavailable response or uses explicitly configured legacy fallback paths for old deployments only. New production users should be provisioned as customer VPSes.
Customer VPS Runtime
Each customer VPS gets:
/opt/matrix/env/host.envwith machine ID, Clerk user ID, handle,DATABASE_URL,PLATFORM_INTERNAL_URL, and per-hostUPGRADE_TOKEN./opt/matrix/env/r2.envwith R2 credentials scoped for backup/sync./opt/matrix/appfrom the host bundle: gateway package, shell build, shared packages, and bundled default apps./opt/matrix/runtimefrom the host bundle: Node, code-server, and bundled coding-agent CLIs./opt/matrix/binlaunchers formatrix-gateway,matrix-shell,matrix-code,matrix-sync-agent, andmatrix-update./home/matrix/homefor owner files and apps.- A local Postgres database endpoint at
127.0.0.1:5432.
The current bootstrap runs Postgres as a single local postgres:16 service container named matrix-postgres with a machine-local volume. Gateway/shell/code/default apps are not user runtime containers; they run through systemd host services from the host bundle.
Gateway Identity
Gateway routes resolve owner identity through the request principal seam. A validated JWT subject wins first. If no JWT is present, a trusted single-user/container gateway may use the platform-provisioned configured identity from runtime configuration, with MATRIX_USER_ID as the canonical user id. This value must come from Matrix OS provisioning, not from request headers, query params, cookies, route params, or request bodies.
Open local development may use the dev-default principal only when auth is disabled, production is false, the environment is local/development, and no configured container identity exists. Production and auth-enabled deployments refuse that fallback.
Required Environment
| Variable | Required | Notes |
|---|---|---|
CUSTOMER_VPS_ENABLED | Yes | Enables the VPS provisioning path for the intended environment. |
CUSTOMER_VPS_IMAGE_VERSION | Yes | Selects the host bundle key at system-bundles/<imageVersion>/matrix-host-bundle.tar.gz. |
MATRIX_HOST_BUNDLE_URL | No | Optional override for the exact bundle URL. By default cloud-init downloads through the platform tunnel at /system-bundles/<imageVersion>/matrix-host-bundle.tar.gz. |
MATRIX_HOST_BUNDLE_BASE_URL | No | Optional base URL for default bundle URL generation when not using MATRIX_HOST_BUNDLE_URL; defaults to PLATFORM_PUBLIC_URL. |
CUSTOMER_VPS_TLS_VERIFY | No | Defaults to false because phase-1 customer hosts use self-signed local TLS on :443. Set true only after installing publicly trusted host certificates. |
HETZNER_API_TOKEN | Yes | Hetzner Cloud API token for provisioning and deletion. |
R2_BUCKET / S3_BUCKET | Yes | Bucket used for metadata, DB snapshots, and host bundles. |
MATRIX_USER_ID | Yes for trusted single-user/container gateways | Platform-provisioned stable owner id used as the configured container identity when no validated JWT principal is present. |
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEY | Yes when building a host bundle | Baked into the Next.js shell bundle. If missing, production browsers may try to load clerk.example.com. |
PLATFORM_INTERNAL_URL | Yes on customer VPSes | Base URL used by customer gateways for platform-owned integration and bundle/update APIs. |
UPGRADE_TOKEN | Yes on customer VPSes | Per-host bearer token used for platform internal calls. |
DATABASE_URL | Yes on customer VPSes | Points the gateway at the customer-local Postgres database. |
Host Bundle Updates
Per-user VPSes do not run the legacy Matrix OS Docker user image. They run host services installed from:
system-bundles/<CUSTOMER_VPS_IMAGE_VERSION>/matrix-host-bundle.tar.gz
system-bundles/<CUSTOMER_VPS_IMAGE_VERSION>/matrix-host-bundle.tar.gz.sha256Rebuild and publish this bundle whenever shell, gateway, bundled apps, host scripts, or runtime CLIs change. Existing VPSes need an explicit in-place refresh or recovery/reprovision until automated bundle upgrades are implemented.
Before rollout, verify:
- Served shell HTML and client chunks do not reference
clerk.example.com. - Gateway health returns OK and uses the VPS Postgres database.
/api/bridge/querycan list app schemas from the customer-local Postgres database.- Fresh browser loads do not call legacy
/api/canvas. - Missing app icons resolve through stable fallbacks rather than a repeated Gemini 503 loop.
- Canvas pan/zoom starts only from the canvas surface, not from wheel events inside a selected app window.
App Runtime And Postgres
First-party and polished default apps are Vite + React apps with runtime: "vite" and build.output: "dist" in matrix.json. The host-bundle build runs scripts/build-default-apps.mjs, and customer VPS startup copies the built app dist/ assets plus manifests into /home/matrix/home/apps.
Apps use the local Postgres database through the gateway bridge:
await fetch("/api/bridge/query", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
action: "insert",
app: "todo",
table: "tasks",
data: { text: "Ship it", done: false },
}),
});The gateway registers manifest-declared storage.tables into schema-per-app Postgres tables. App child processes do not receive raw DATABASE_URL; the bridge is the intended scoped API for easy, safe access to the owner-local database.
Backup Retention
The customer host runs matrix-db-backup.timer hourly. The backup script must upload a timestamped snapshot before updating system/db/latest.
R2 keys:
system/vps-meta.json: current machine metadata and heartbeat timestamp.system/db/latest: latest successful snapshot pointer.system/db/snapshots/<timestamp>.dump: Postgres custom-format snapshot restored directly withpg_restore.
Retention pruning is deferred in this slice, so the hourly backup path uploads a new snapshot and updates system/db/latest without calling a no-op prune command.
Manual Recovery
Use recovery when a customer VPS is failed, unrecoverable, or intentionally replaced.
curl -sS -X POST "$PLATFORM_PUBLIC_URL/vps/recover" \
-H "Authorization: Bearer $PLATFORM_SECRET" \
-H "Content-Type: application/json" \
-d '{"clerkUserId":"user_test_vps"}'Expected behavior:
- The platform verifies
system/db/latestunlessallowEmptyis explicitly true. - The active machine row moves to
recoveringwith a newmachineId. - The old Hetzner server is deleted if it exists.
- The replacement server boots from cloud-init and restores before gateway startup.
- The VPS registers and eventually returns
running.
Use allowEmpty only for a new or intentionally empty user:
curl -sS -X POST "$PLATFORM_PUBLIC_URL/vps/recover" \
-H "Authorization: Bearer $PLATFORM_SECRET" \
-H "Content-Type: application/json" \
-d '{"clerkUserId":"user_test_vps","allowEmpty":true}'Restored State
Restored:
- Postgres app data included in the latest successful snapshot.
- VPS metadata needed for routing and operator checks.
Not restored in this slice:
- Any data that was never uploaded to R2.
- In-memory process state.
- A failed backup that did not update
system/db/latest.
If restore fails, matrix-restore.service exits non-zero and matrix-gateway.service remains gated by ConditionPathExists=/opt/matrix/restore-complete.
Rollback
Rollback is a routing/operator decision:
- Users without a
runninguser_machinesrow should not be treated as successfully provisioned production users. - To stop serving a VPS user, delete or move the machine out of
runningstate and verify the user sees the intended unavailable/reprovisioning path. DELETE /vps/:machineIdsoft-deletes the platform row and deletes the Hetzner server, but does not remove R2 data.
Do not request review or rollout approval while still pushing commits to the branch.
How is this guide?