← Back to blog
platforminfrastructureself-hostingpostgresdocker

Gaia: the platform layer under my own personal site

Why I built a multi-tenant platform on one €5/month VPS instead of using a managed PaaS — and what owning the substrate has actually taught me.

The stack, briefly

Four personal projects — a portfolio site, a family recipe app, a workout tracker, and a privacy-friendly analytics dashboard — all run on one €5/month VPS in a Hetzner datacenter in Falkenstein. They share a Postgres cluster, an nginx ingress, a JWT auth service, and a wildcard TLS cert. None of them know about each other; all of them inherit those four things for free.

The shared layer is called gaia. It is, depending on which afternoon you ask me, either an elegant minimalist platform or a fragile pile of bash scripts. Both are correct.

This post is about why I built it, what it cost in time and money, and the specific decisions that have held up under the test of reality. It is also, in a quiet way, an argument against using Kubernetes for things you could have done with a for loop.

The actual components

gaia owns four things and nothing else:

  • Postgres 16 as a shared cluster, one container, per-app roles and databases.
  • nginx as the only thing listening on :80 and :443 from the public internet, doing TLS termination, ingress routing by subdomain, and per-route rate limiting.
  • A small FastAPI auth-service that issues JWTs as cookies scoped to .karanbanga.com. This single decision is what gives me cross-subdomain SSO — the user logs in once at auth.karanbanga.com and is logged into everything.
  • lego for wildcard cert issuance and renewal via DNS-01 against GoDaddy. I would have used certbot, but certbot has no official GoDaddy plugin and I learned this the hard way after an evening spent debugging a Docker image that did not exist.

Each consumer app lives in its own repo, its own docker-compose, its own /opt/<app>/ directory on the box. They join gaia's shared Docker network as external participants, get their Postgres provisioning from a config-driven script, and inherit TLS and auth without owning any of that complexity themselves. Adding a new app is mostly an exercise in copy-pasting yesterday's app and changing the names.

The pattern that makes it work

Every app gets a Postgres role and database via one comma-separated environment variable on the gaia stack:

PLATFORM_DBS=hestia_auth_user:HESTIA_AUTH_DB_PASSWORD:hestia_auth,demeter_user:DEMETER_DB_PASSWORD:demeter,umami_user:UMAMI_DB_PASSWORD:umami

Three colon-separated fields per app: role name, name of the env var that holds the password, database name. A 30-line init.sh runs on Postgres first-boot, iterates the triples, and uses bash indirect expansion — ${!pw_var} — to dereference each password from the environment without ever logging it. Adding a new app's database is: append a triple to the env var, recreate Postgres, done. No Terraform, no Helm chart, no Pulumi stack.

The middle field being an env var name rather than the value matters. It means the comma-delimited string never contains a secret. The script is the only thing that ever sees the actual password, and it sees it through the symbol table. The whole pattern fits on one screen.

This is the kind of design that makes infrastructure people physically uncomfortable. There's no schema, no validation framework, no idempotency tests. A typo in the env var name will silently provision a role with an empty password and your script will move on cheerfully. I am aware of all of this. I have been adding apps to this platform for several months and it has never been the bottleneck.

The lesson I keep relearning: the size of the solution should match the size of the problem. Provisioning four database roles for personal projects does not need any of the things I keep wanting to add. It needs a bash loop.

(I did eventually add tests for the bash loop. They run in 3 seconds, validate empty-password and missing-field cases, and saved me about ten minutes once. Worth it. But still: bash loop.)

The auth trick

The piece I'm proudest of is the single-sign-on, and the trick is one character.

The auth-service issues a JWT as a cookie with Domain=.karanbanga.com. The leading dot is the entire mechanism: it tells the browser to send the cookie to every subdomain. The user logs in at auth.karanbanga.com, gets the cookie, and from then on demeter.karanbanga.com, ares.karanbanga.com, and the admin UI on the apex all see the same auth_token on every request. Each backend validates it locally using a shared JWT_SECRET. No auth round-trips per request. No service to maintain whose job is "remember if you're logged in."

The invariants are sharp. The Domain must be exactly .karanbanga.com — drop the dot and the cookie scopes to a single subdomain. The JWT_SECRET must be byte-identical in every consumer's .env — drift by one character and every backend silently returns 401s, and you spend an evening wondering why a working system stopped working. I have a memory in my notes called "JWT / cookie cross-app invariants" that exists because the alternative is wondering, again.

nginx is doing more than nginx usually does

One nginx container fronts everything. Apps don't bind host ports in production — they join the shared Docker network, and nginx proxies to them by service name. Each app gets a vhost in gaia/nginx/conf.d/:

server {
    listen 443 ssl;
    server_name demeter.karanbanga.com;

    ssl_certificate     /etc/letsencrypt/live/karanbanga.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/karanbanga.com/privkey.pem;

    location / {
        set $demeter_frontend demeter-frontend:3000;
        proxy_pass http://$demeter_frontend;
        # ...
    }
}

That set $var ...; proxy_pass http://$var; two-step is load-bearing. If you write the upstream as a literal hostname, nginx tries to resolve it when the config is parsed, which means nginx -t fails any time the upstream container isn't running. Putting it in a variable defers DNS resolution to first-request time, where it works against Docker's embedded resolver and the config tests clean even with half the platform down.

I found this out the slow way. I added an nginx -t job to CI months after I'd started writing vhosts with literal hostnames, and watched it stay red on every commit for two weeks before I finally read the error message. There is a category of bug you only discover by adding the test that catches it, and that's an argument for adding the test sooner.

The wildcard cert is the other quiet load-bearer. A *.karanbanga.com cert from lego, renewed weekly via cron, means every new subdomain just works. The first time I added a new app and https://newapp.karanbanga.com returned 200 with valid TLS on the first request — no certbot dance, no DNS challenge ceremony, no waiting — it was one of those moments where you suspect you've done something illegal.

The hardening, which is half the project

The initial cutover onto this platform was the easy half. The interesting half has been the slow accumulation of "make this harder to break."

Phase 8 added daily Postgres dumps and MinIO volume tarballs to /root/backups/, 14 and 7 day retentions, with a non-destructive restore drill that loads the dump into a throwaway container and verifies the data. The drill caught nothing — which is the point. The point of restore drills is that you find out backups don't restore before you need them to restore.

The next layer was offsite. The local dumps protect against application bugs and accidental deletes; they do not protect against the box catching fire. So now there's a third cron at 05:00 that takes each local backup file, gpg-encrypts it to a public key whose private half lives only in my password manager, and rclones the encrypted blob to a Backblaze B2 bucket. The threat model is deliberate: even if the box is rooted, the attacker cannot decrypt the offsite copies. The full chain — encrypt-on-box → upload → download-to-laptop → decrypt with my offline key → restore into a throwaway Postgres — got tested end-to-end the day it shipped. It cost about $0.01.

Uptime Kuma watches every endpoint on a 5-minute interval and pages me through Slack on failure. There are now seven monitors, all green. The interesting thing about adding monitoring is how it changes your reaction to a problem: I stopped checking that things were up, and started trusting that I'd hear about it if they weren't. That trust took longer to build than the monitors did.

None of this is impressive engineering individually. Together it's the difference between "I run a platform" and "I run a platform that will probably still be up tomorrow."

What it actually costs

The VPS: a Hetzner CX-series instance with 4 GB RAM / 2 vCPU / 75 GB SSD in Falkenstein, runs about €5/month. Domain: $20/year from GoDaddy. Backblaze B2 for offsite: pennies right now, plausibly $1/month if the MinIO bodies ever grow up. Call it $7/month all in.

The honest comparison to a managed equivalent is messier than internet writers usually admit. Still, Vercel Pro is $20 per seat, Neon's paid tier starts at $19, Auth0's cheapest plan is $35 once you outgrow the free tier, and you'd want some kind of object storage on top. The point isn't that any specific number is right; the point is that you'd be looking at $60-100/month, every month, forever, to host four small personal projects nobody but me uses every day.

I am not saving that delta. I'm spending it in different currency: an hour here, an afternoon there, occasional Saturday mornings lost to figuring out why a thing I touched yesterday broke today. The exchange rate is not great if you bill yourself out at engineering rates. It is excellent if you value understanding the substrate your projects sit on.

The parts that aren't fun

The platform breaks in ways managed services don't, and the fixes are mine to find. A few from earlier this month:

The PLATFORM_DBS script silently provisioned a role with an empty password because I typoed an env var name and the script didn't bother to check. I noticed three days later when the new app couldn't connect. The fix was a five-line validation block and a tiny test suite. It would never have shipped on Auth0, but Auth0 also doesn't let me iterate on the underlying pattern in 30 seconds.

Recreating the Postgres container — which I do every time I add an app — silently invalidated the asyncpg connection pools in every consumer service. SQLAlchemy doesn't pre-ping its async pool by default. The pools sat there returning "connection is closed" 500s for four days before I happened to try logging into the admin UI and noticed. Fix is one keyword argument: pool_pre_ping=True. The harder fix was changing my mental model so that "I touched Postgres" automatically meant "I need to restart the backends."

I rotated my admin password and the rotation worked the third time. The first attempt nested heredocs in a way that broke an SSH session silently. The second attempt's shell-quoting accidentally wrote an empty bcrypt hash to the database, with cheerful UPDATE 1 confirmation from psql. The third attempt finally used psql's native parameter substitution and worked on the first try. I learned a generalizable rule: never trust UPDATE N as proof of correctness on a sensitive column. Always follow with a SELECT to verify the value is what you intended.

None of these are sophisticated bugs. All of them are the texture of running infrastructure — the moments where you stop building features and learn how the substrate actually behaves under pressure. Some weeks that texture is the whole project. Some weeks I'd trade it for not having it.

What I'd tell my past self

Build the smallest version of every piece first. The platform exists because I needed one place to put a portfolio site, then I noticed I had a second project that could share the auth, then a third that could share the database. None of it was planned as platform-thinking up front. It accreted. I think that's the only way these things actually work, because the alternative is you spend six months building "a platform" before there's a single app that wants to live on it.

Write down the cross-cutting invariants — the JWT secret, the cookie domain, the network name, the Postgres role pattern — in a place you will actually re-read. They are exactly the kind of thing you will forget the moment you stop thinking about them, and exactly the kind of thing whose failure mode is silent. My workspace has a CLAUDE.md at the root that exists for this reason and this reason only.

Don't try to make it look like a real platform. There will be a moment, around month two, where you will think: should I add a service mesh? An observability stack? Vault? The answer is no. You are not running Google. The thing that makes a one-person platform tractable is that it stays one person's worth of complexity, which means you have to actively resist every shiny tool that would multiply that.

And: every time you reach for a tool that promises to save you 30 minutes today by spending 4 hours of your future, ask whether the leverage is real or whether you are just enjoying the build. Sometimes that's the right answer — building is its own reward. Sometimes the answer is "Vercel is fine, ship the actual project." I've gotten that one wrong in both directions, and I'm not always sure I'm getting it right now.

The platform under this blog post is a few weeks old in its current consolidated shape. It runs four apps, costs about the price of a coffee per month, and is more thoroughly tested and documented than most professional infrastructure I've worked on. The repos are on GitHub, if you want to see the substrate.


This post is the first of a series of project deep-dives — each app on the platform gets its own. Demeter is next, after Phase 2's recipe-import flow ships. Then ares.