Idempotency & resume
Bridges retry. Networks blip. Phones go to sleep mid-stream. SAP is built so that all of these turn out to be no-ops on the client side — every mutating route is keyed, the SSE stream has a ring buffer, and there's a snapshot endpoint for everything older than the buffer.
Why bother
Without idempotency, every retry risks double-billing the user
(a duplicate bubble), corrupting the in-flight tool card (two
task_completed events for one task), or worse. With it, the
contract is simple: the bridge POSTs until it gets a 2xx, and
the server makes sure that's exactly-once on the user's
side.
Per-route keys
Every mutating bridge route accepts an idempotency key. The key's shape is per-route:
| Route | Key field | Notes |
|---|---|---|
sendMessage | idempotency_key (UUID) | Unique on (session_id, idempotency_key) |
sendMessageDelta | idempotency_key (UUID) | Unique on (message_id, idempotency_key) |
sendMessageEnd | idempotency_key (UUID) | Unique on (message_id, idempotency_key) |
createTask | task_id | The natural id is the key |
updateTask | task_id (+ optional idempotency_key for distinct events) | |
finishTask | task_id | The natural id is the key |
requestApproval | approval_id | The natural id is the key |
For routes with an explicit idempotency_key:
- The pair
(token, idempotency_key)is unique within 24 hours. - Same key, same body → returns the cached response with
idempotent: true. - Same key, different body →
409 idempotency_conflict. - After 24 h, keys are recycled.
Use UUIDv4. If you want logs to be greppable, prefix with
something semantic: int_abc-msg-attempt-1.
Retry pattern
async function postWithRetry(path: string, body: object) {
for (let attempt = 0; attempt < 5; attempt++) {
const r = await fetch(`https://api.sophon.at${path}`, {
method: 'POST',
headers: { Authorization: `Bearer ${TOKEN}`, 'Content-Type': 'application/json' },
body: JSON.stringify(body),
})
if (r.ok) return r.json()
if (r.status === 429 || r.status === 503) {
const retryAfter = Number(r.headers.get('Retry-After') ?? '1')
await sleep((retryAfter + Math.random() * 0.25) * 1000)
continue
}
if (r.status >= 500) {
await sleep(2 ** attempt * 1000)
continue
}
throw new Error(`POST ${path} → ${r.status}`)
}
throw new Error(`POST ${path} → exhausted retries`)
}The body — including the idempotency_key — is identical every
attempt.
At-least-once delivery
The server delivers updates at-least-once to your bridge.
Network drops between ack and processing happen. Your bridge
must dedupe by update_id:
-- Postgres
INSERT INTO processed_updates (update_id) VALUES ($1)
ON CONFLICT DO NOTHING
RETURNING update_id;
-- if no row returned, this is a duplicate, skip processingYour processed_updates table can be tiny — keep the last 10 k
ids and TTL the older ones. The server's buffer is 5 minutes,
so anything older than that is ancient history.
SSE resume on the iOS side
iOS connects to GET /v1/me/stream with Last-Event-ID set to
the highest event id it persisted. The server replays from there
through a 5-minute / 256-event ring buffer, then continues
live.
GET /v1/me/stream HTTP/1.1
Authorization: Bearer <user_session>
Last-Event-ID: 12842
← id: 12843
event: message_delta
data: { ... }
← id: 12844
event: message_finalized
data: { ... }
← (continues live)
If the gap is larger than the buffer (the user went to sleep, the TLS connection died on a flight), the server emits a special event telling the client "you missed too much, fetch a snapshot".
Cold-launch snapshot
GET /v1/me/snapshot
Sessions and messages are already durable — iOS hydrates them via
/v1/me and /v1/me/sessions/:id/messages. Tool calls are
ephemeral by design (W12 v1). What the SSE ring buffer DOES drop on
the floor when it rolls is the live state of pending approvals —
the user opens the app after lunch and the "may I run rm -rf?"
sheet should still be on screen.
So the snapshot endpoint returns just that:
{
"ok": true,
"result": {
"ts": 1730345700000, // server time the snapshot was taken
"pending_approvals": [
{
"approval_id": "apr_…",
"session_id": "ses_…",
"installation_id": "inst_…",
"agent_id": null, // null on bridge approvals
"interaction_id": "int_…",
"action": "exec_command",
"title": "Run rm -rf …",
"message": "…",
"severity": "high",
"command": "rm -rf node_modules",
"host": "rom-MacBook-Pro",
"tool_call_id": "tc_…",
"expires_at": 1730345900000,
"ts": 1730345600000
}
]
}
}iOS calls this on cold launch — PlatformAdapter.refreshSnapshot()
runs between refreshMe() and stream.start(). Each
pending_approvals[] entry folds through the same handler the SSE
approval_requested event uses, so re-emits dedupe by
approval_id. Then the SSE attaches and forward state is live.
Future waves may extend the response — there's room to add
pending_tool_calls, unread_message_counts, etc. without
breaking clients (the response is forward-compatible).
Putting the rules together
Three invariants every connector author should hold in their head:
- Mutating POST = retry until 2xx, with the same key. The server makes it exactly-once.
updatefrom the WS = dedupe byupdate_id. The server delivers at-least-once.- Long-gap clients = call snapshot first, then SSE. Don't
try to walk older than 5 minutes through
Last-Event-ID.
If you're building on top of OpenClaw, connectors/openclaw-bridge
already does (1) and the server handles (2) and (3) for the iOS
side — you don't need to think about them.