Migrations — Expand-and-Contract
OnRamp runs zero-downtime deploys against a live, multi-tenant Postgres. Schema changes that lock tables or break existing queries will cause 500s during the deploy window. Expand-and-contract is the pattern that prevents this.
Reference: Prisma expand-and-contract guide
The four phases
| Phase | What happens | OnRamp example |
|---|---|---|
| Expand | Add the new structure alongside the old. Both old and new app code work. | Add nullable attempt_account_name_match column to or_ramps with server_default. |
| Migrate data | Backfill existing rows; run in batches to avoid lock contention. | UPDATE or_ramps SET attempt_account_name_match = true WHERE ... in a follow-on migration. |
| Contract | Remove the old structure once no app code references it. Separate PR, separate deploy — see Contract phase. | |
| Cleanup | Remove any dual-write compatibility shims in application code. | Delete fallback logic that read from both old and new column. |
Anti-patterns Squawk catches
Click for the full rule-by-rule table
| Rule | Why it breaks zero-downtime | Expand-and-contract alternative |
|---|---|---|
ban-drop-column | Existing queries/ORMs break mid-deploy if they reference the column. | Ship code removal first; drop column in a later contract-phase migration. |
ban-drop-table | Same as above at table scope. | Remove all references from app code before dropping. |
ban-drop-not-null | Dropping NOT NULL on a column used by active queries can change query plans and behavior unexpectedly. | Audit all callers first. |
renaming-column | Any in-flight query using the old name fails immediately. | Expand: add new column, dual-write, migrate, contract. |
renaming-table | Same as renaming-column at table scope. | Expand: new table, dual-write, contract. |
changing-column-type | Postgres may rewrite the entire table, holding an ACCESS EXCLUSIVE lock. | Add new typed column, backfill, update app, drop old. |
adding-required-field | NOT NULL without a default triggers a full table rewrite (pre-PG 11) or blocks reads on large tables. | Nullable first, backfill, add constraint with NOT VALID, validate separately. |
constraint-missing-not-valid | Validating a constraint inline locks the table. | ADD CONSTRAINT ... NOT VALID, then VALIDATE CONSTRAINT in a separate transaction. |
require-concurrent-index-creation | CREATE INDEX without CONCURRENTLY holds a write-blocking lock for the duration. | Always use CREATE INDEX CONCURRENTLY. Requires op.execute() outside a transaction block. |
require-concurrent-index-deletion | Same issue for DROP INDEX. | DROP INDEX CONCURRENTLY. |
disallowed-unique-constraint | ADD CONSTRAINT ... UNIQUE blocks writes during build. | CREATE UNIQUE INDEX CONCURRENTLY, then ADD CONSTRAINT ... USING INDEX. |
transaction-nesting | Wrapping a concurrent-index operation in a transaction negates the concurrency. | Run concurrent DDL outside BEGIN/COMMIT blocks. |
prefer-robust-stmts | Migrations that fail halfway leave the DB in a partial state. | Use IF EXISTS / IF NOT EXISTS guards on all DDL statements. |
Full rule documentation: https://squawkhq.com/docs/
Running locally
make lint.migrations # offline, no DB required — same as CI
make lint.migrations.local # also pings 127.0.0.1:5432/onramp_local for connectivity sanityBoth render new migrations to SQL via uv run flask db upgrade <down>:<new> --sql and pipe to bun run squawk. Neither needs schema knowledge from a live DB.
Contract phase — how to safely drop
TL;DR. Dropping is the dangerous half. Use two releases minimum: release N stops the app from reading/writing the column, release N+1 drops it. Squawk's
ban-drop-columnblocks accidents — the workflow below shows how to drop on purpose. Time waits are a floor, not a gate; the real proof is "zero traffic on the old path."
Quick start — the contract migration template
"""contract: drop or_ramps.attempt_account_name_match
Revision ID: ...
Revises: ...
Create Date: 2026-06-01 09:00:00.000000
Contract phase. Parent expand PR: #8543 (ONRAMP-4810).
Column removed from app code + ORM in PR #8612, deployed 2026-05-20.
Cooldown: 12 days; zero references in app/, onramp-agents/, app/ui-*.
"""
import sqlalchemy as sa
from alembic import op
revision = "..."
down_revision = "..."
branch_labels = None
depends_on = None
def upgrade():
op.execute(
"-- squawk-ignore ban-drop-column,require-timeout-settings,prefer-robust-stmts"
)
op.drop_column("or_ramps", "attempt_account_name_match")
def downgrade():
# Re-add as nullable; backfill is lost.
op.add_column(
"or_ramps",
sa.Column("attempt_account_name_match", sa.Boolean(), nullable=True),
)The five gates
A contract PR must satisfy all five before merge:
- [ ] 1. Code path removed in a prior, already-deployed release. No reads, no writes, no analytics, no agent SQL-tool exposure.
- [ ] 2. ORM ignores the column in that prior release. SQLAlchemy
mapped_columndeleted from*_models.py, or excluded via__mapper_args__. Done in release N — not the same PR as the drop. - [ ] 3. Cooldown observed. At minimum, the expand/migrate PR has been merged + deployed to prod ≥ 14 days. Preferably gated on a feature flag at 100% with zero error events.
- [ ] 4. Grep evidence in PR description.
rg "<column_name>" --type py --type vue --type ts onramp-agents/ app/ playwright/returns zero hits outsidemigrations/. - [ ] 5. Second reviewer — platform / DBA owner, not the original author.
Each gate is independent. Skipping any one of them is how production incidents happen.
Why two releases? Why can't I drop in the same PR that removed the code?
Rolling deploys mean old pods keep serving traffic for minutes-to-hours after new pods come up. If release N+0 both removes the ORM mapping AND drops the column, the old pod's in-flight transactions still reference the column → 500s. Same applies to async workers, scheduled jobs, replication consumers, agent SQL-tool queries cached for a session.
Strong Migrations (Rails ecosystem) puts it this way: "Ignoring and dropping columns should not occur simultaneously in the same release — first ignore the column (release M), then drop it in the next release (release M+1)." Same logic applies to SQLAlchemy. (Strong Migrations docs)
Squawk's own ban-drop-column guidance: "Update your application code to no longer read or write the column," then "delete the column once queries no longer select or modify it" — explicitly two steps. (Squawk ban-drop-column)
The ignore-then-drop pattern in SQLAlchemy
Release N (expand-PR follow-up, before contract):
- Delete the
mapped_columndeclaration entirely fromapp/api/<domain>/or_models.py. Or, if the column is referenced elsewhere transitively, mark itdeferred=Trueand stop reading it. - Remove every
Model.column_namereference in services, helpers, controllers, DTOs, mappers. - Remove every raw-SQL reference (search
app/,onramp-agents/,migrations/not-yet-applied). - Ship + deploy to prod. Verify error rates clean for ≥ cooldown window.
Release N+1 (contract PR — the one in this template):
- Migration file only. No app code changes. Drop the column.
Release N+2 (optional cleanup):
- Remove any
deferredmarkers or compat shims that bridged the gap.
Preferred: feature flag as the cooldown gate
Calendar waits are a heuristic. The real signal is zero traffic on the old path. A feature flag gives you that proof:
- Expand-phase PR introduces a flag (e.g.,
ramps.use-new-account-match-flow). Default off. - Code reads/writes both columns when the flag is off; writes only the new column when on.
- Roll the flag to 1% → 10% → 100% over your normal feature-flag cadence.
- Observe error rates + agent-SQL-tool query patterns at 100% for 7 days.
- Open the contract PR. The "cooldown" is now provable: 7 days of 100%-flag traffic with no errors.
This pattern is the 2026 industry default — beats time-based cooldowns because it surfaces forgotten readers during rollout instead of after the drop. (Featureflow guide)
Extra safety: the _deprecated_ rename trick
For high-value columns (anything in or_projects, or_tasks, anything customer-facing), insert an extra ceremonial step between release N and release N+1:
- Release N+0.5: rename
column_name→_deprecated_column_name. Any forgotten reader will crash on a missing column instead of silently working until drop day. Cheap, durable canary. - Wait one deploy window.
- Release N+1: drop the renamed column.
The rename itself triggers Squawk's renaming-column rule, so it needs the same -- squawk-ignore ceremony. Atlas v0.37+ supports this as a first-class pattern via the allow_column { match = "_deprecated_.+" } policy — we don't use Atlas, but the convention is the same. (Atlas v0.37 release notes)
Objective usage proof (advanced)
When grep + feature flag aren't enough — typically for columns referenced via dynamic SQL or agent-generated queries — query Postgres directly:
-- Has this column been touched in the last 14 days?
SELECT query, calls, last_call
FROM pg_stat_statements
WHERE query ILIKE '%attempt_account_name_match%'
AND last_call > now() - interval '14 days';Zero rows = no live SQL traffic. Paste output into the contract PR description. Also worth scanning agent SQL-tool MLflow traces (see docs/sql-tool-architecture.md) since the agent generates SQL at runtime that won't appear in your grep.
Preventing forgotten contracts
The most common failure isn't dropping too early — it's never dropping at all, and the schema accumulates stale columns for years.
Suggested countermeasures:
- Auto-file a follow-up ticket on expand merge. When a PR labeled
migrationadds a new column intended to replace an old one, the merge bot opens anONRAMP-####"Contractor_ramps.<col>" ticket with due date = today + 21 days and assignee = expand-PR author. - Quarterly stale-column report. Scheduled agent (we have AgentCore + intel runtime) runs the
pg_stat_statementsquery above against every column not auto-classified inAUTO_EXPOSED_COLUMN_NAMES. Posts contract candidates to#engineeringmonthly. - Tracking field in expand PR template. "Contract follow-up ticket: ONRAMP-####" — required before expand can merge.
None of these are implemented yet — file an issue if you want to take one on.
Escape hatch syntax (mechanics)
The contract-migration template above uses Squawk's -- squawk-ignore directive. Mechanics worth knowing:
- Directive applies to the next statement only.
- Multiple rules comma-separated:
-- squawk-ignore ban-drop-column,require-timeout-settings. scripts/lint-migrations.shstrips the trailing;and blank line that Alembic emits aroundop.execute(), so the directive lands flush against the target statement after rendering. You don't need to do anything special — write the comment, let the script normalize.- Never use
-- squawk-ignore-file— too broad. Always name the specific rules. - The migration file header must reference the parent expand PR and Jira ticket (see template above).
If you find yourself adding squawk-ignore for any reason other than a contract-phase drop, stop and ask. There's almost always a more idiomatic expansion that doesn't need the ignore.
Further reading
- Prisma: Expand and Contract Pattern — primary reference for the pattern
- Squawk rule docs — rule-by-rule explanations with fix examples
- Squawk
ban-drop-column— official drop workflow - Strong Migrations (Rails) — Rails ecosystem; the
ignored_columnspattern is the canonical reference for two-release drops - GitLab: avoiding downtime in migrations —
db/post_migratedirectory pattern; useful precedent if we ever want to enforce sequencing at the directory level - pgroll — full expand/contract orchestration tool; instructive even if we don't adopt
- PlanetScale Rails gem — what a fully automated ignore-then-drop pipeline looks like
- Feature flags + DB migrations — flag-driven cooldowns