Sophon

Idempotency & resume

Bridges retry. Networks blip. Phones go to sleep mid-stream. SAP is built so that all of these turn out to be no-ops on the client side — every mutating route is keyed, the SSE stream has a ring buffer, and there's a snapshot endpoint for everything older than the buffer.

Why bother

Without idempotency, every retry risks double-billing the user (a duplicate bubble), corrupting the in-flight tool card (two task_completed events for one task), or worse. With it, the contract is simple: the bridge POSTs until it gets a 2xx, and the server makes sure that's exactly-once on the user's side.

Per-route keys

Every mutating bridge route accepts an idempotency key. The key's shape is per-route:

RouteKey fieldNotes
sendMessageidempotency_key (UUID)Unique on (session_id, idempotency_key)
sendMessageDeltaidempotency_key (UUID)Unique on (message_id, idempotency_key)
sendMessageEndidempotency_key (UUID)Unique on (message_id, idempotency_key)
createTasktask_idThe natural id is the key
updateTasktask_id (+ optional idempotency_key for distinct events)
finishTasktask_idThe natural id is the key
requestApprovalapproval_idThe natural id is the key

For routes with an explicit idempotency_key:

  • The pair (token, idempotency_key) is unique within 24 hours.
  • Same key, same body → returns the cached response with idempotent: true.
  • Same key, different body → 409 idempotency_conflict.
  • After 24 h, keys are recycled.

Use UUIDv4. If you want logs to be greppable, prefix with something semantic: int_abc-msg-attempt-1.

Retry pattern

async function postWithRetry(path: string, body: object) {
  for (let attempt = 0; attempt < 5; attempt++) {
    const r = await fetch(`https://api.sophon.at${path}`, {
      method: 'POST',
      headers: { Authorization: `Bearer ${TOKEN}`, 'Content-Type': 'application/json' },
      body: JSON.stringify(body),
    })
    if (r.ok) return r.json()
    if (r.status === 429 || r.status === 503) {
      const retryAfter = Number(r.headers.get('Retry-After') ?? '1')
      await sleep((retryAfter + Math.random() * 0.25) * 1000)
      continue
    }
    if (r.status >= 500) {
      await sleep(2 ** attempt * 1000)
      continue
    }
    throw new Error(`POST ${path} → ${r.status}`)
  }
  throw new Error(`POST ${path} → exhausted retries`)
}

The body — including the idempotency_key — is identical every attempt.

At-least-once delivery

The server delivers updates at-least-once to your bridge. Network drops between ack and processing happen. Your bridge must dedupe by update_id:

-- Postgres
INSERT INTO processed_updates (update_id) VALUES ($1)
  ON CONFLICT DO NOTHING
  RETURNING update_id;
-- if no row returned, this is a duplicate, skip processing

Your processed_updates table can be tiny — keep the last 10 k ids and TTL the older ones. The server's buffer is 5 minutes, so anything older than that is ancient history.

SSE resume on the iOS side

iOS connects to GET /v1/me/stream with Last-Event-ID set to the highest event id it persisted. The server replays from there through a 5-minute / 256-event ring buffer, then continues live.

GET /v1/me/stream HTTP/1.1
Authorization: Bearer <user_session>
Last-Event-ID: 12842

← id: 12843
  event: message_delta
  data: { ... }

← id: 12844
  event: message_finalized
  data: { ... }

← (continues live)

If the gap is larger than the buffer (the user went to sleep, the TLS connection died on a flight), the server emits a special event telling the client "you missed too much, fetch a snapshot".

Cold-launch snapshot

GET /v1/me/snapshot

Sessions and messages are already durable — iOS hydrates them via /v1/me and /v1/me/sessions/:id/messages. Tool calls are ephemeral by design (W12 v1). What the SSE ring buffer DOES drop on the floor when it rolls is the live state of pending approvals — the user opens the app after lunch and the "may I run rm -rf?" sheet should still be on screen.

So the snapshot endpoint returns just that:

{
  "ok": true,
  "result": {
    "ts": 1730345700000,                  // server time the snapshot was taken
    "pending_approvals": [
      {
        "approval_id":     "apr_…",
        "session_id":      "ses_…",
        "installation_id": "inst_…",
        "agent_id":        null,           // null on bridge approvals
        "interaction_id":  "int_…",
        "action":          "exec_command",
        "title":           "Run rm -rf …",
        "message":         "…",
        "severity":        "high",
        "command":         "rm -rf node_modules",
        "host":            "rom-MacBook-Pro",
        "tool_call_id":    "tc_…",
        "expires_at":      1730345900000,
        "ts":              1730345600000
      }
    ]
  }
}

iOS calls this on cold launch — PlatformAdapter.refreshSnapshot() runs between refreshMe() and stream.start(). Each pending_approvals[] entry folds through the same handler the SSE approval_requested event uses, so re-emits dedupe by approval_id. Then the SSE attaches and forward state is live.

Future waves may extend the response — there's room to add pending_tool_calls, unread_message_counts, etc. without breaking clients (the response is forward-compatible).

Putting the rules together

Three invariants every connector author should hold in their head:

  1. Mutating POST = retry until 2xx, with the same key. The server makes it exactly-once.
  2. update from the WS = dedupe by update_id. The server delivers at-least-once.
  3. Long-gap clients = call snapshot first, then SSE. Don't try to walk older than 5 minutes through Last-Event-ID.

If you're building on top of OpenClaw, connectors/openclaw-bridge already does (1) and the server handles (2) and (3) for the iOS side — you don't need to think about them.