debugginge2edockernginxnextjstestingci-cd

The bug was cookies. The bug was never cookies.

The design doc said thirty minutes. It took a day. A debugging story about the gap between knowing what's broken and knowing what it'll take to fix it — told in six layers, each one hidden behind the last.

May 31, 2026

A green checkmark, eventually

For two weeks, one row in my CI dashboard was red. The E2E suite for my portfolio site — the Playwright tests that log into the admin panel and create a blog post the way I actually do — had been failing on every single push since the 16th of May. Everything else was green. The frontend built. The backend tests passed. The site was live and working in production. But Post CRUD had been failing for fourteen days, and I had a written diagnosis explaining exactly why.

The diagnosis was wrong. Not factually wrong — it correctly identified a real problem. It was wrong in the way a weather forecast that says "rain" is wrong when what actually happens is hail, then a flood, then a tree through your roof. It described the first thing and none of the things that mattered.

This is a story about the difference between knowing what's broken and knowing what it'll take to fix it. They feel like the same knowledge. They are not.

The diagnosis I trusted

Here's what I knew going in, written up in a design doc with a Playwright trace ID attached as evidence.

My E2E rig runs four processes on four different localhost ports: the Next.js frontend on :3000, the FastAPI backend on :8000, the shared auth-service on :8100, and a Postgres. The test logs in — the browser POSTs to the auth-service on :8100, which replies with a Set-Cookie: auth_token=... — and then tries to create a post, which means the browser POSTs to the backend on :8000. And that second request goes out without the cookie. The backend sees an anonymous request, returns 401, and the browser console lights up with a misleading CORS error that sends you chasing the wrong thing for an afternoon.

The root cause, confirmed from the network log in the trace: Chromium will store a cookie scoped to Domain=localhost but won't send it on a cross-port request. localhost:8100 setting a cookie and localhost:8000 reading it are, as far as the browser's cookie policy is concerned, different enough to matter. Production was unaffected — there, everything is same-origin under karanbanga.com with a Domain=.karanbanga.com cookie, so the whole class of problem evaporates. This was purely a test-rig artifact.

The fix was correspondingly clean, and the design doc estimated it at thirty minutes of code plus one CI cycle: stop having the browser talk to backend ports directly. Route every browser API call through the Next.js server as a proxy — /api/auth/* and /api/* as catch-all route handlers that forward to the Docker-internal services. The browser only ever talks to :3000. Cookies become first-party. No cross-port problem because there's no cross-port traffic. As a bonus, this matches production's architecture exactly, where nginx already terminates everything under one origin.

Thirty minutes. I had done the hard part — the diagnosis — and the rest was typing. You can probably guess where this is going, because you've read the title.

Surprise one: `docker compose --wait` is a lie

I wrote the two proxy routes, wired them up, and added a CI step to verify the proxy actually forwarded requests before running the full Playwright suite. The verify step curled the proxy endpoint right after docker compose up -d --wait returned. And it failed: curl: (56) Failure when receiving network data.

The log lines made it look like the proxy route wasn't registering — like Next.js had silently ignored my new catch-all handler. I spent a while staring at the route file convinced I'd misnamed it.

The actual problem: docker compose up -d --wait doesn't wait for anything useful unless your containers declare a healthcheck. Mine didn't. The frontend and backend dev containers have no healthcheck, so --wait reported them "Healthy" the instant the container started — which is the moment the process inside begins booting, not the moment it's ready to serve. The curl was racing a process that hadn't opened its socket yet. The TCP handshake completed through Docker's port-proxy, but there was no listener behind it to send any data back. Hence exit 56, which reads like a network failure and is actually a "you arrived early" failure.

--wait had been quietly meaningless this whole time. I'd just never written a CI step fast enough to notice. The fix was a poll loop — curl, sleep, retry up to ninety seconds — instead of trusting a readiness signal that was never real.

Surprise two: Next.js compiles your routes when you knock, not when you build

Even with the poll loop, the first request to the proxy was slow and weird in a way the later ones weren't. This one I should have known.

Next.js in dev mode compiles route handlers just-in-time, on first request. The catch-all /api/[...path]/route.ts doesn't exist as runnable code until the first time something hits it, at which point Turbopack compiles it on the spot — which can take one to five seconds. So the first curl in the poll loop wasn't just racing the boot; it was triggering the compile, and timing out before the compile finished.

The thing that made this expensive to diagnose is that a half-compiled route fails the same way an absent route does. The only way I localized it was to put a console.log and a try/catch around every single step inside the proxy handler — read the incoming request, build the upstream URL, fetch, stream the response back — so that the CI logs would tell me how far into the handler execution actually got before dying. This is the move when a failure refuses to localize itself: you stop guessing and make the code narrate. That instrumentation is the only reason surprise two and surprise three are distinguishable from each other at all, because they both present as "the POST 500s with an unhelpful body."

Surprise three: the bucket that was never there

With the proxy compiling and the cookie now riding along first-party, the login worked, the cookie reached the backend, the backend authorized the request — and POST /posts returned a 500. Progress! A new failure is progress. It meant the original bug was actually fixed and I'd uncovered the next layer down.

The 500 was an S3 error: InvalidAccessKeyId. The backend stores blog media in MinIO, and it was trying to write to a bucket using a service account that didn't exist.

It didn't exist because the container that creates it never ran. In my compose file, the MinIO init step — the one-shot job that creates the portfolio-blog bucket and provisions the service account — is gated behind a Docker Compose profile (profiles: ["init"]). This is deliberate: in local dev you don't want to re-run bucket initialization on every up. But a plain docker compose up skips profiled services entirely, and my CI rig didn't know to ask for the profile. So CI had a MinIO with no buckets and no credentials, and the failure was invisible right up until the first request that actually tried to write — because reads against a never-written bucket fail open, and silently.

The whole time Post CRUD had been "failing because of cookies," it had also been sitting on a backend that couldn't have stored a post even if the cookie had arrived. Two independent bugs, perfectly stacked, the second one completely hidden behind the first. Fixing the cookie was what exposed it. The fix was one word in the CI invocation: --profile init.

Surprise four: the rate limit that only failed sometimes

Now the suite passed. Then it failed. Then it passed. Then it failed — thirteen of sixteen tests one run, all sixteen the next, with no code change in between. The worst kind of green: the kind you can't trust.

The Post CRUD describe block logs in fresh before every test — six-plus logins per run, all from the CI runner's single IP. The auth-service has a hardcoded rate limit of five logins per minute, keyed by client address. There's no environment knob to turn it off; the only thing that disables it is a flag the auth-service flips inside its own test fixtures. So a fast CI run would squeak all its logins in before the one-minute window closed and pass; a slow run would trip the limit partway through, get a 429, fail the login, bounce back to the login page, and Playwright would fail with "expected /admin/posts, got /admin/login."

Every previous "green" run I'd celebrated had been lucky timing — a coin that kept landing heads. I had not fixed the test. I had gotten away with it. The fix was to reach into the auth-service's source from the consumer repo's CI workflow, append the line that disables the limiter, and rebuild the image so the patch actually baked in rather than getting overwritten by the published image from the registry.

Surprise five: production had its own opinion

At this point CI was genuinely, repeatably green — sixteen of sixteen, several runs in a row. I shipped it. And then, because the whole architecture had changed how the browser talks to the backend, I did the thing the original 30-minute estimate never imagined I'd need to do: I tested logging into production.

Production login returned a 404.

The new same-origin model has the browser POST to karanbanga.com/api/auth/login. In dev and CI, the Next.js catch-all proxy intercepts that and forwards it to the auth-service. But in production, nginx sits in front of everything and terminates /api/* requests before they ever reach Next.js — routing them to the backend, which has no /auth/* routes at all. nginx was stripping the prefix and handing POST /auth/login to a service that had never heard of it. Clean 404.

The fix was an nginx location block for /api/auth/ placed above the generic /api/ block, so that longest-prefix matching catches authentication requests first and sends them to the auth-service instead of the backend. One block, correctly ordered. But it only existed as a problem in production, which the test rig — by construction — could never have shown me. The thing I'd built to make the test environment match production had a seam at exactly the layer the test environment didn't include.

The actual lesson

Add it up: a cookie-policy quirk in the browser, a readiness flag that lied in Docker, a JIT compile in Next.js, a profile-gated init container in Compose, a hardcoded rate limit in the auth-service, and a routing gap in nginx. One bug, surfacing in six different layers of the stack — and it wasn't fixed until I'd found all six. The written diagnosis described one layer, and described it accurately. The estimate was thirty minutes. The real thing took a day, and most of that day was spent discovering that each fix didn't finish the job — it just unlocked the next failure.

I want to be honest about what that felt like, because the tidy retrospective hides it. In the middle of this, watching the build go red again and again, I kept reacting like things were breaking. "Failed again." "Still failing." But nothing was breaking. Every one of those failures was a fix working — peeling off the top layer of the onion and exposing the next one, which had been there the entire time, hidden. In a stacked failure, a new error message isn't a setback; it's a receipt for the fix that just exposed it. The cookie fix didn't cause the MinIO bug; it revealed it. The proxy didn't cause the nginx 404; it required the routing that production was missing. A diagnosis that stops at the first true thing makes all the subsequent true things feel like regressions, when they're really just the parts of the problem your flashlight didn't reach.

So the lesson isn't "diagnose more carefully," exactly — the original diagnosis was correct, it was just shallow. The lesson is to hold a confident root cause more loosely than its confidence invites. "I know what's wrong" and "I know what it'll take to make it right" are different claims, and the gap between them is where the day goes. When a doc says thirty minutes and the problem has been unsolved for two weeks, the thirty minutes is measuring the part someone already understood. The two weeks is measuring everything they didn't.

The dashboard row is green now. Sixteen of sixteen, and I trust it this time, because I finally know what all six layers were. The repos are on GitHub if you want to read the commit trail — it's a tidy little staircase of fix(e2e) messages, each one a layer of the onion, none of which I knew about when I wrote "30 minutes" in a design doc and believed it.

This is one of a series of posts about running my own little platform. The infrastructure underneath it all is its own story.