Skip to content

Workflow Execution Retry — Runbook

Bulk-replay failed workflow executions sitting in or_ramp_executions with status_code='ERROR'. Use when a prod regression breaks Project Create and customers have accounts without projects.

History: built for ONRAMP-4887 (PR #8570). See the PR for the audited skip-filter design.

When to use

  • A code regression in the workflow path is fixed and deployed.
  • One or more or_ramp_executions rows are stuck in ERROR since the regression window.
  • You want to re-drive them without re-firing CRM webhooks or running raw SQL by hand.

When NOT to use

  • The underlying bug is not fixed in prod yet — replay will re-fail immediately. Confirm Sentry shows the regression issue at zero new events before retrying.
  • The execution failed with a workflow-config error (missing trigger field, missing related account, etc.). Fix the config first; replay does not solve it. The endpoint will skip these unless you pass force_retry: true.

Endpoint

POST /api/ramps/executions/retry-batch

Requires ONRAMP_ADMIN. Accepts your browser session cookie (after Google SSO into the admin UI) OR internal API-key headers for automation.

  1. Sign in to https://app.onramp.us as an ONRAMP_ADMIN user via Google SSO.
  2. Open browser devtools → Application (Chrome) or Storage (Firefox) → Cookies → https://app.onramp.us.
  3. Copy the value of the session cookie.
  4. Use it as -b "session=<value>" in the curl examples below.

If you have an internal API key + secret (for runbook automation), use -H "Authorization: <key>" -H "or-internal-api-auth-secret: <secret>" instead of -b. Same endpoint, same response.

Workflow

Always run in dry-run first, review the eligible + skipped lists, then execute.

1. Size the blast radius

sql
SELECT COUNT(*), vendor_id, ramp_id
FROM   or_ramp_executions
WHERE  status_code = 'ERROR'
  AND  created_at >= '<incident-start-utc>'
GROUP  BY vendor_id, ramp_id;

2. Dry run

bash
curl -X POST https://app.onramp.us/api/ramps/executions/retry-batch \
  -b "session=<your-session-cookie>" \
  -H "Content-Type: application/json" \
  -d '{
    "since": "2026-05-11T12:30:00Z",
    "error_pattern": "new_task",
    "dry_run": true
  }'

Response includes eligible[] (will replay) and skipped[] (with reason). Review both.

3. Execute

bash
curl -X POST https://app.onramp.us/api/ramps/executions/retry-batch \
  -b "session=<your-session-cookie>" \
  -H "Content-Type: application/json" \
  -d '{
    "since": "2026-05-11T12:30:00Z",
    "error_pattern": "new_task",
    "suppress_notifications": true,
    "dry_run": false
  }'

Always pass suppress_notifications: true during mass recovery — otherwise every replay re-fires PROJECT_CREATED webhooks to the customer's CRM and re-sends the "project created" email.

Batch cap is 25 per call. If matched > 25, run repeatedly — each call picks up the still-ERROR remainder.

4. Verify

  • summary.success matches the expected project count.
  • Sentry shows no new UnboundLocalError (or whatever the original failure was).
  • A WORKFLOW_HISTORY_ACTION_REPLAYED entry exists per replay (or_workflow_history).

Request fields

FieldRequiredNotes
sinceconditionalUTC. Required unless execution_uuids provided.
untilnoDefaults to now(). Half-open: [since, until).
vendor_idsnoRestrict to specific tenants.
ramp_uuidsnoRestrict to specific workflows.
execution_uuidsnoHand-picked list. Mutually exclusive with since/until.
error_patternnoSubstring against the last log entry's error message.
suppress_notificationsnoDefault false. Set true for mass recovery.
force_retrynoBypass config-error skip (filter 7). Debugging only.
dry_runyesDefault true. Always preview first.

Skip reasons

If skipped[].reason is one of these, the row was intentionally excluded:

  • project_already_exists — a project is already linked to this execution. Don't replay; investigate manually.
  • ramp_archived / ramp_deleted / workflow_is_draft — the workflow has changed state since failure.
  • status_not_error — the execution moved out of ERROR (someone else fixed it).
  • error_pattern_mismatch — when error_pattern is set and this row's error doesn't match.
  • duplicate_trigger_succeeded — the CRM re-fired the same trigger and the later execution succeeded.
  • config_error — the failure was a workflow-config problem, not the regression. Fix config, then force_retry: true.
  • integration_disconnected — the CRM integration is no longer authorized. Reconnect, then retry.
  • outside_time_windowcreated_at falls outside [since, until).

Idempotency

  • Account step is idempotent. Account lookup by external_id (ramps_project_creation_service.py:322-330) reuses the existing ORObjectEntityId linkage if present (archived rows are excluded — see PR #8570).
  • Project step is NOT idempotent on its own. Filter project_already_exists is the only guard. Don't disable it.
  • Each replay is its own transaction. One failure doesn't abort the batch.

Race safety

Eligibility query uses SELECT ... FOR UPDATE OF e SKIP LOCKED. Concurrent operators or async-automation-handler inserts won't double-replay the same row.

Limitations

  • Synchronous, batch cap 25 per HTTP call. Larger backlogs require pagination.
  • Filter 8 (integration disconnected) checks user_configured: bool only — a token revoked after configuration won't be caught here; replay will surface it as summary.error.
  • Workflow version drift is advisory (eligible[].workflow_version_drift: true) — the OLD workflow version still runs. If you republished after the failure and want the new version, retrigger from the CRM instead of replaying.

See also

Internal documentation — gated behind Cloudflare Access.