Matrix OSMatrix OS

VPS-per-User

Operator notes for customer VPS provisioning, backup, and recovery.

VPS-per-User

Matrix OS production user runtime is VPS-native: one customer VPS per active user. The platform remains the control plane for auth, routing, provisioning, integrations, R2 host-bundle publication, upgrades, and recovery.

Implementation history and production verification notes are tracked in specs/070-vps-per-user/changelog.md.

Production Scope

Included:

  • Lazy provisioning through POST /vps/provision for an authenticated internal operator request.
  • One-time host registration through POST /vps/register.
  • VPS-first routing for users with a running user_machines row.
  • Shared code.matrix-os.com routing to the authenticated user's VPS-hosted code-server gateway.
  • Customer host restore gate, hourly Postgres backups, and R2 metadata pointers.
  • Manual recovery through POST /vps/recover or matrixctl recover.
  • Host-bundle based updates for shell, gateway, code, default apps, and runtime CLIs.
  • Local owner-controlled Postgres on each customer VPS at 127.0.0.1:5432.

Not included:

  • Automatic unreachable detection and replacement.
  • Sleep, warm pools, idle deletion, and geographic routing.
  • Data deletion from R2 during phase-1 VPS deletion.

Cost And Quota

The default server type is controlled by HETZNER_SERVER_TYPE and currently targets cpx22. Before adding a customer, confirm:

  • The Hetzner customer project has quota for one additional server.
  • The expected monthly cost is accepted by the operator.
  • The customer is explicitly opted in.
  • CUSTOMER_VPS_ENABLED=true is set only in the intended environment.

Quota ceiling: one VPS per active Clerk user. Do not batch-enable users until recovery and rollback have been exercised for a non-production account.

Routing

code.matrix-os.com is a single public entrypoint. The platform authenticates the Clerk session or matrix_code_session, resolves the user to a running VPS, strips user cookies and authorization headers, then forwards to that VPS over HTTPS with platform proof headers. If no running VPS exists, the platform returns an unavailable response or uses explicitly configured legacy fallback paths for old deployments only. New production users should be provisioned as customer VPSes.

Customer VPS Runtime

Each customer VPS gets:

  • /opt/matrix/env/host.env with machine ID, Clerk user ID, handle, DATABASE_URL, PLATFORM_INTERNAL_URL, and per-host UPGRADE_TOKEN.
  • /opt/matrix/env/r2.env with R2 credentials scoped for backup/sync.
  • /opt/matrix/app from the host bundle: gateway package, shell build, shared packages, and bundled default apps.
  • /opt/matrix/runtime from the host bundle: Node, code-server, and bundled coding-agent CLIs.
  • /opt/matrix/bin launchers for matrix-gateway, matrix-shell, matrix-code, matrix-sync-agent, and matrix-update.
  • /home/matrix/home for owner files and apps.
  • A local Postgres database endpoint at 127.0.0.1:5432.

The current bootstrap runs Postgres as a single local postgres:16 service container named matrix-postgres with a machine-local volume. Gateway/shell/code/default apps are not user runtime containers; they run through systemd host services from the host bundle.

Gateway Identity

Gateway routes resolve owner identity through the request principal seam. A validated JWT subject wins first. If no JWT is present, a trusted single-user/container gateway may use the platform-provisioned configured identity from runtime configuration, with MATRIX_USER_ID as the canonical user id. This value must come from Matrix OS provisioning, not from request headers, query params, cookies, route params, or request bodies.

Open local development may use the dev-default principal only when auth is disabled, production is false, the environment is local/development, and no configured container identity exists. Production and auth-enabled deployments refuse that fallback.

Required Environment

VariableRequiredNotes
CUSTOMER_VPS_ENABLEDYesEnables the VPS provisioning path for the intended environment.
CUSTOMER_VPS_IMAGE_VERSIONYesSelects the host bundle key at system-bundles/<imageVersion>/matrix-host-bundle.tar.gz.
MATRIX_HOST_BUNDLE_URLNoOptional override for the exact bundle URL. By default cloud-init downloads through the platform tunnel at /system-bundles/<imageVersion>/matrix-host-bundle.tar.gz.
MATRIX_HOST_BUNDLE_BASE_URLNoOptional base URL for default bundle URL generation when not using MATRIX_HOST_BUNDLE_URL; defaults to PLATFORM_PUBLIC_URL.
CUSTOMER_VPS_TLS_VERIFYNoDefaults to false because phase-1 customer hosts use self-signed local TLS on :443. Set true only after installing publicly trusted host certificates.
HETZNER_API_TOKENYesHetzner Cloud API token for provisioning and deletion.
R2_BUCKET / S3_BUCKETYesBucket used for metadata, DB snapshots, and host bundles.
MATRIX_USER_IDYes for trusted single-user/container gatewaysPlatform-provisioned stable owner id used as the configured container identity when no validated JWT principal is present.
NEXT_PUBLIC_CLERK_PUBLISHABLE_KEYYes when building a host bundleBaked into the Next.js shell bundle. If missing, production browsers may try to load clerk.example.com.
PLATFORM_INTERNAL_URLYes on customer VPSesBase URL used by customer gateways for platform-owned integration and bundle/update APIs.
UPGRADE_TOKENYes on customer VPSesPer-host bearer token used for platform internal calls.
DATABASE_URLYes on customer VPSesPoints the gateway at the customer-local Postgres database.

Host Bundle Updates

Per-user VPSes do not run the legacy Matrix OS Docker user image. They run host services installed from:

system-bundles/<CUSTOMER_VPS_IMAGE_VERSION>/matrix-host-bundle.tar.gz
system-bundles/<CUSTOMER_VPS_IMAGE_VERSION>/matrix-host-bundle.tar.gz.sha256

Rebuild and publish this bundle whenever shell, gateway, bundled apps, host scripts, or runtime CLIs change. Existing VPSes need an explicit in-place refresh or recovery/reprovision until automated bundle upgrades are implemented.

Before rollout, verify:

  • Served shell HTML and client chunks do not reference clerk.example.com.
  • Gateway health returns OK and uses the VPS Postgres database.
  • /api/bridge/query can list app schemas from the customer-local Postgres database.
  • Fresh browser loads do not call legacy /api/canvas.
  • Missing app icons resolve through stable fallbacks rather than a repeated Gemini 503 loop.
  • Canvas pan/zoom starts only from the canvas surface, not from wheel events inside a selected app window.

App Runtime And Postgres

First-party and polished default apps are Vite + React apps with runtime: "vite" and build.output: "dist" in matrix.json. The host-bundle build runs scripts/build-default-apps.mjs, and customer VPS startup copies the built app dist/ assets plus manifests into /home/matrix/home/apps.

Apps use the local Postgres database through the gateway bridge:

await fetch("/api/bridge/query", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({
    action: "insert",
    app: "todo",
    table: "tasks",
    data: { text: "Ship it", done: false },
  }),
});

The gateway registers manifest-declared storage.tables into schema-per-app Postgres tables. App child processes do not receive raw DATABASE_URL; the bridge is the intended scoped API for easy, safe access to the owner-local database.

Backup Retention

The customer host runs matrix-db-backup.timer hourly. The backup script must upload a timestamped snapshot before updating system/db/latest.

R2 keys:

  • system/vps-meta.json: current machine metadata and heartbeat timestamp.
  • system/db/latest: latest successful snapshot pointer.
  • system/db/snapshots/<timestamp>.dump: Postgres custom-format snapshot restored directly with pg_restore.

Retention pruning is deferred in this slice, so the hourly backup path uploads a new snapshot and updates system/db/latest without calling a no-op prune command.

Manual Recovery

Use recovery when a customer VPS is failed, unrecoverable, or intentionally replaced.

curl -sS -X POST "$PLATFORM_PUBLIC_URL/vps/recover" \
  -H "Authorization: Bearer $PLATFORM_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"clerkUserId":"user_test_vps"}'

Expected behavior:

  • The platform verifies system/db/latest unless allowEmpty is explicitly true.
  • The active machine row moves to recovering with a new machineId.
  • The old Hetzner server is deleted if it exists.
  • The replacement server boots from cloud-init and restores before gateway startup.
  • The VPS registers and eventually returns running.

Use allowEmpty only for a new or intentionally empty user:

curl -sS -X POST "$PLATFORM_PUBLIC_URL/vps/recover" \
  -H "Authorization: Bearer $PLATFORM_SECRET" \
  -H "Content-Type: application/json" \
  -d '{"clerkUserId":"user_test_vps","allowEmpty":true}'

Restored State

Restored:

  • Postgres app data included in the latest successful snapshot.
  • VPS metadata needed for routing and operator checks.

Not restored in this slice:

  • Any data that was never uploaded to R2.
  • In-memory process state.
  • A failed backup that did not update system/db/latest.

If restore fails, matrix-restore.service exits non-zero and matrix-gateway.service remains gated by ConditionPathExists=/opt/matrix/restore-complete.

Rollback

Rollback is a routing/operator decision:

  • Users without a running user_machines row should not be treated as successfully provisioned production users.
  • To stop serving a VPS user, delete or move the machine out of running state and verify the user sees the intended unavailable/reprovisioning path.
  • DELETE /vps/:machineId soft-deletes the platform row and deletes the Hetzner server, but does not remove R2 data.

Do not request review or rollout approval while still pushing commits to the branch.

How is this guide?

On this page