Workflow Execution Retry — Runbook
Bulk-replay failed workflow executions sitting in or_ramp_executions with status_code='ERROR'. Use when a prod regression breaks Project Create and customers have accounts without projects.
History: built for ONRAMP-4887 (PR #8570). See the PR for the audited skip-filter design.
When to use
- A code regression in the workflow path is fixed and deployed.
- One or more
or_ramp_executionsrows are stuck inERRORsince the regression window. - You want to re-drive them without re-firing CRM webhooks or running raw SQL by hand.
When NOT to use
- The underlying bug is not fixed in prod yet — replay will re-fail immediately. Confirm Sentry shows the regression issue at zero new events before retrying.
- The execution failed with a workflow-config error (missing trigger field, missing related account, etc.). Fix the config first; replay does not solve it. The endpoint will skip these unless you pass
force_retry: true.
Endpoint
POST /api/ramps/executions/retry-batchRequires ONRAMP_ADMIN. Accepts your browser session cookie (after Google SSO into the admin UI) OR internal API-key headers for automation.
Auth — grab your session cookie
- Sign in to https://app.onramp.us as an
ONRAMP_ADMINuser via Google SSO. - Open browser devtools → Application (Chrome) or Storage (Firefox) → Cookies →
https://app.onramp.us. - Copy the value of the
sessioncookie. - Use it as
-b "session=<value>"in the curl examples below.
If you have an internal API key + secret (for runbook automation), use -H "Authorization: <key>" -H "or-internal-api-auth-secret: <secret>" instead of -b. Same endpoint, same response.
Workflow
Always run in dry-run first, review the eligible + skipped lists, then execute.
1. Size the blast radius
SELECT COUNT(*), vendor_id, ramp_id
FROM or_ramp_executions
WHERE status_code = 'ERROR'
AND created_at >= '<incident-start-utc>'
GROUP BY vendor_id, ramp_id;2. Dry run
curl -X POST https://app.onramp.us/api/ramps/executions/retry-batch \
-b "session=<your-session-cookie>" \
-H "Content-Type: application/json" \
-d '{
"since": "2026-05-11T12:30:00Z",
"error_pattern": "new_task",
"dry_run": true
}'Response includes eligible[] (will replay) and skipped[] (with reason). Review both.
3. Execute
curl -X POST https://app.onramp.us/api/ramps/executions/retry-batch \
-b "session=<your-session-cookie>" \
-H "Content-Type: application/json" \
-d '{
"since": "2026-05-11T12:30:00Z",
"error_pattern": "new_task",
"suppress_notifications": true,
"dry_run": false
}'Always pass suppress_notifications: true during mass recovery — otherwise every replay re-fires PROJECT_CREATED webhooks to the customer's CRM and re-sends the "project created" email.
Batch cap is 25 per call. If matched > 25, run repeatedly — each call picks up the still-ERROR remainder.
4. Verify
summary.successmatches the expected project count.- Sentry shows no new
UnboundLocalError(or whatever the original failure was). - A
WORKFLOW_HISTORY_ACTION_REPLAYEDentry exists per replay (or_workflow_history).
Request fields
| Field | Required | Notes |
|---|---|---|
since | conditional | UTC. Required unless execution_uuids provided. |
until | no | Defaults to now(). Half-open: [since, until). |
vendor_ids | no | Restrict to specific tenants. |
ramp_uuids | no | Restrict to specific workflows. |
execution_uuids | no | Hand-picked list. Mutually exclusive with since/until. |
error_pattern | no | Substring against the last log entry's error message. |
suppress_notifications | no | Default false. Set true for mass recovery. |
force_retry | no | Bypass config-error skip (filter 7). Debugging only. |
dry_run | yes | Default true. Always preview first. |
Skip reasons
If skipped[].reason is one of these, the row was intentionally excluded:
project_already_exists— a project is already linked to this execution. Don't replay; investigate manually.ramp_archived/ramp_deleted/workflow_is_draft— the workflow has changed state since failure.status_not_error— the execution moved out ofERROR(someone else fixed it).error_pattern_mismatch— whenerror_patternis set and this row's error doesn't match.duplicate_trigger_succeeded— the CRM re-fired the same trigger and the later execution succeeded.config_error— the failure was a workflow-config problem, not the regression. Fix config, thenforce_retry: true.integration_disconnected— the CRM integration is no longer authorized. Reconnect, then retry.outside_time_window—created_atfalls outside[since, until).
Idempotency
- Account step is idempotent. Account lookup by
external_id(ramps_project_creation_service.py:322-330) reuses the existingORObjectEntityIdlinkage if present (archived rows are excluded — see PR #8570). - Project step is NOT idempotent on its own. Filter
project_already_existsis the only guard. Don't disable it. - Each replay is its own transaction. One failure doesn't abort the batch.
Race safety
Eligibility query uses SELECT ... FOR UPDATE OF e SKIP LOCKED. Concurrent operators or async-automation-handler inserts won't double-replay the same row.
Limitations
- Synchronous, batch cap 25 per HTTP call. Larger backlogs require pagination.
- Filter 8 (integration disconnected) checks
user_configured: boolonly — a token revoked after configuration won't be caught here; replay will surface it assummary.error. - Workflow version drift is advisory (
eligible[].workflow_version_drift: true) — the OLD workflow version still runs. If you republished after the failure and want the new version, retrigger from the CRM instead of replaying.
See also
devtools/mock-integrations/README.md— fire a fresh workflow execution against a local Flask instance.- PR #8570 — the endpoint implementation and skip-filter audit.
- Jira ONRAMP-4887 — incident this was built for.