All posts
engineering6 min read

How we debugged a broken MCP connection in five minutes — using our own observability stack

A user pinged us today: their Claude Code MCP connection to Launchmatic kept failing. The auth flow completed cleanly — Authentication successful — but then immediately: server reconnection failed. You may need to manually restart Claude Code.

This is a story about how the platform we ship was the platform that found the bug.

The wrong way to debug this

The first instinct on a Bearer-token-protected SSE transport is to start guessing. Token audience mismatch? Cookie domain? Outbound firewall? CORS preflight on the SSE channel? Each of those is a real failure mode and any of them could produce that exact client-side message.

Burning ten minutes on a wrong hypothesis is easy. We needed actual signal.

The right way

Step one: pull a live tail of the production api pod logs. Two seconds:

kubectl logs -n default launchmatic-api-7486889656-8c7xk -c api --tail=200 \
  | grep -iE "mcp|session|oauth|error"

What came back wasn't an error at all:

{"name":"mcp-route","sessionId":"f94e53ac-...","userId":"cmnnj9yoq...",
 "total":4,"message":"MCP session created"}
{"req":{"method":"POST","url":"/","host":"mcp.launchmatic.io"},
 "message":"incoming request"}
{"req":{"method":"POST","url":"/","host":"mcp.launchmatic.io"},
 "message":"incoming request"}
{"req":{"method":"GET","url":"/","host":"mcp.launchmatic.io"},
 "message":"incoming request"}
{"name":"mcp-route","sessionId":"951a2355-...","reason":"idle_timeout",
 "remaining":3,"message":"MCP session closed"}

The session was being created. Three follow-up requests landed on it. An SSE channel opened. Then thirty minutes later it closed for idle timeout. From the server's perspective, the user was successfully connected.

So the failure was somewhere between "connection established" and "Claude Code thinks it's connected." Two layers we could check next.

The client log

Claude Code writes its own structured logs at ~/AppData/Local/claude-cli-nodejs/Cache/<project-hash>/mcp-logs-launchmatic/. The latest file:

{"debug":"HTTP connection dropped after 600s uptime"}
{"debug":"Connection error: SSE stream disconnected: TimeoutError"}
{"debug":"Terminal connection error 2/3"}

Six hundred seconds. Ten minutes. That's the nginx-ingress proxy-read-timeout default. The MCP SSE transport keeps a connection open indefinitely for server-initiated messages; the proxy was killing it at 10:00.0 every time, Claude Code was retrying twice, then giving up on the third disconnect — surfacing the cryptic "reconnection failed" message even though the underlying transport had been working fine.

The fix is one annotation on the mcp Ingress:

nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"

Three lines of yaml. But that's not the interesting part. The interesting part is that the actual diagnosis took five minutes once we stopped guessing — because the platform was telling us exactly what was happening, on both sides of the wire, the whole time.

What the platform is for

This is the moment a lot of platform-as-a-service tools fail their users. Heroku-style platforms are great for the happy path; the moment something looks weird, you're shipped off to "check Datadog" or "check New Relic" or "configure your own log drain to Papertrail."

Launchmatic ships the observability primitives in the box because we use them ourselves. That same incident, framed for an end user instead of an internal engineer, would have been:

  1. Open the service in the dashboard. The new Insights tab shows a green Health pill, four ready replicas, zero restarts, last deploy 2h ago — the service is fine. Already a useful signal: the bug is on the client side, not the server.
  2. Watch the Activity timeline tick: incoming requests are landing successfully every few seconds.
  3. Hit Connected Apps → Recent agent activity to see Claude Code's tool calls succeeding, then suddenly stopping.
  4. Click Explain on the Insights tab. The agent reads the logs, the events, the deploy history, and writes back: "Your API is healthy and accepting requests. The pattern of disconnects every 10 minutes points at an idle-timeout in your ingress. Look at the proxy-read-timeout annotation on the mcp Ingress."

That's not an aspirational sketch. Every one of those views shipped this week:

  • Insights tab lives on every service — health rollup, pods + cluster events, deploy + audit timeline, all in one auto-refreshing view.
  • Recent agent activity lists every tool call any MCP client made on your behalf, with input/output preview.
  • Explain is a button on the Insights health card; one click and the agent has the full context.
  • Security activity is its own tab under Settings — every consent, token issue, refresh, revocation with timestamp + IP.

All of it is read-only over telemetry we were already collecting (Pino logs, Prometheus, Kubernetes events, audit log). The new work was wiring it into the dashboard and giving the agent the right prompt template.

Why this is the right shape of platform

The hostable PaaS market separates into two camps. The opinionated "git push, we figure it out" tier (Heroku, Vercel) keeps you out of the operational mess at the cost of visibility — when something breaks, the platform is a black box. The "rent the EKS cluster" tier hands you the keys and walks away — total visibility, all the operational burden.

Launchmatic's bet is that the right answer is a third thing: keep the opinionated, push-and-go developer experience, but never hide the operational data. Logs, metrics, events, audit, tool-call traces — they're all in your dashboard, on your timeline, attributed to your services. The agent has the same access you do, so when you don't want to read 500 lines of pod log, you can ask the agent to read them for you.

Same shape, by the way, as the bug we shipped this morning: client said "broken," dashboard said "fine," logs said "everything's working except this one annotation." The product told us the answer. We just had to look.

— The Launchmatic team

Try the new Insights tab on any service in your dashboard, and the Explain button when something's red. Recent agent activity is on Settings → Connected Apps; Security activity is on Settings → Security.