Migrations — Expand-and-Contract

OnRamp runs zero-downtime deploys against a live, multi-tenant Postgres. Schema changes that lock tables or break existing queries will cause 500s during the deploy window. Expand-and-contract is the pattern that prevents this.

Reference: Prisma expand-and-contract guide

The four phases

Phase	What happens	OnRamp example
Expand	Add the new structure alongside the old. Both old and new app code work.	Add nullable `attempt_account_name_match` column to `or_ramps` with `server_default`.
Migrate data	Backfill existing rows; run in batches to avoid lock contention.	`UPDATE or_ramps SET attempt_account_name_match = true WHERE ...` in a follow-on migration.
Contract	Remove the old structure once no app code references it. Separate PR, separate deploy — see Contract phase.
Cleanup	Remove any dual-write compatibility shims in application code.	Delete fallback logic that read from both old and new column.

Anti-patterns Squawk catches

Click for the full rule-by-rule table

Rule	Why it breaks zero-downtime	Expand-and-contract alternative
`ban-drop-column`	Existing queries/ORMs break mid-deploy if they reference the column.	Ship code removal first; drop column in a later contract-phase migration.
`ban-drop-table`	Same as above at table scope.	Remove all references from app code before dropping.
`ban-drop-not-null`	Dropping NOT NULL on a column used by active queries can change query plans and behavior unexpectedly.	Audit all callers first.
`renaming-column`	Any in-flight query using the old name fails immediately.	Expand: add new column, dual-write, migrate, contract.
`renaming-table`	Same as renaming-column at table scope.	Expand: new table, dual-write, contract.
`changing-column-type`	Postgres may rewrite the entire table, holding an `ACCESS EXCLUSIVE` lock.	Add new typed column, backfill, update app, drop old.
`adding-required-field`	`NOT NULL` without a default triggers a full table rewrite (pre-PG 11) or blocks reads on large tables.	Nullable first, backfill, add constraint with `NOT VALID`, validate separately.
`constraint-missing-not-valid`	Validating a constraint inline locks the table.	`ADD CONSTRAINT ... NOT VALID`, then `VALIDATE CONSTRAINT` in a separate transaction.
`require-concurrent-index-creation`	`CREATE INDEX` without `CONCURRENTLY` holds a write-blocking lock for the duration.	Always use `CREATE INDEX CONCURRENTLY`. Requires `op.execute()` outside a transaction block.
`require-concurrent-index-deletion`	Same issue for `DROP INDEX`.	`DROP INDEX CONCURRENTLY`.
`disallowed-unique-constraint`	`ADD CONSTRAINT ... UNIQUE` blocks writes during build.	`CREATE UNIQUE INDEX CONCURRENTLY`, then `ADD CONSTRAINT ... USING INDEX`.
`transaction-nesting`	Wrapping a concurrent-index operation in a transaction negates the concurrency.	Run concurrent DDL outside `BEGIN`/`COMMIT` blocks.
`prefer-robust-stmts`	Migrations that fail halfway leave the DB in a partial state.	Use `IF EXISTS` / `IF NOT EXISTS` guards on all DDL statements.

Full rule documentation: https://squawkhq.com/docs/

Running locally

bash

make lint.migrations           # offline, no DB required — same as CI
make lint.migrations.local     # also pings 127.0.0.1:5432/onramp_local for connectivity sanity

Both render new migrations to SQL via uv run flask db upgrade <down>:<new> --sql and pipe to bun run squawk. Neither needs schema knowledge from a live DB.

Contract phase — how to safely drop

TL;DR. Dropping is the dangerous half. Use two releases minimum: release N stops the app from reading/writing the column, release N+1 drops it. Squawk's ban-drop-column blocks accidents — the workflow below shows how to drop on purpose. Time waits are a floor, not a gate; the real proof is "zero traffic on the old path."

Quick start — the contract migration template

python

"""contract: drop or_ramps.attempt_account_name_match

Revision ID: ...
Revises: ...
Create Date: 2026-06-01 09:00:00.000000

Contract phase. Parent expand PR: #8543 (ONRAMP-4810).
Column removed from app code + ORM in PR #8612, deployed 2026-05-20.
Cooldown: 12 days; zero references in app/, onramp-agents/, app/ui-*.
"""

import sqlalchemy as sa
from alembic import op

revision = "..."
down_revision = "..."
branch_labels = None
depends_on = None


def upgrade():
    op.execute(
        "-- squawk-ignore ban-drop-column,require-timeout-settings,prefer-robust-stmts"
    )
    op.drop_column("or_ramps", "attempt_account_name_match")


def downgrade():
    # Re-add as nullable; backfill is lost.
    op.add_column(
        "or_ramps",
        sa.Column("attempt_account_name_match", sa.Boolean(), nullable=True),
    )

The five gates

A contract PR must satisfy all five before merge:

[ ] 1. Code path removed in a prior, already-deployed release. No reads, no writes, no analytics, no agent SQL-tool exposure.
[ ] 2. ORM ignores the column in that prior release. SQLAlchemy mapped_column deleted from *_models.py, or excluded via __mapper_args__. Done in release N — not the same PR as the drop.
[ ] 3. Cooldown observed. At minimum, the expand/migrate PR has been merged + deployed to prod ≥ 14 days. Preferably gated on a feature flag at 100% with zero error events.
[ ] 4. Grep evidence in PR description. rg "<column_name>" --type py --type vue --type ts onramp-agents/ app/ playwright/ returns zero hits outside migrations/.
[ ] 5. Second reviewer — platform / DBA owner, not the original author.

Each gate is independent. Skipping any one of them is how production incidents happen.

Why two releases? Why can't I drop in the same PR that removed the code?

Rolling deploys mean old pods keep serving traffic for minutes-to-hours after new pods come up. If release N+0 both removes the ORM mapping AND drops the column, the old pod's in-flight transactions still reference the column → 500s. Same applies to async workers, scheduled jobs, replication consumers, agent SQL-tool queries cached for a session.

Strong Migrations (Rails ecosystem) puts it this way: "Ignoring and dropping columns should not occur simultaneously in the same release — first ignore the column (release M), then drop it in the next release (release M+1)." Same logic applies to SQLAlchemy. (Strong Migrations docs)

Squawk's own ban-drop-column guidance: "Update your application code to no longer read or write the column," then "delete the column once queries no longer select or modify it" — explicitly two steps. (Squawk ban-drop-column)

The ignore-then-drop pattern in SQLAlchemy

Release N (expand-PR follow-up, before contract):

Delete the mapped_column declaration entirely from app/api/<domain>/or_models.py. Or, if the column is referenced elsewhere transitively, mark it deferred=True and stop reading it.
Remove every Model.column_name reference in services, helpers, controllers, DTOs, mappers.
Remove every raw-SQL reference (search app/, onramp-agents/, migrations/ not-yet-applied).
Ship + deploy to prod. Verify error rates clean for ≥ cooldown window.

Release N+1 (contract PR — the one in this template):

Migration file only. No app code changes. Drop the column.

Release N+2 (optional cleanup):

Remove any deferred markers or compat shims that bridged the gap.

Preferred: feature flag as the cooldown gate

Calendar waits are a heuristic. The real signal is zero traffic on the old path. A feature flag gives you that proof:

Expand-phase PR introduces a flag (e.g., ramps.use-new-account-match-flow). Default off.
Code reads/writes both columns when the flag is off; writes only the new column when on.
Roll the flag to 1% → 10% → 100% over your normal feature-flag cadence.
Observe error rates + agent-SQL-tool query patterns at 100% for 7 days.
Open the contract PR. The "cooldown" is now provable: 7 days of 100%-flag traffic with no errors.

This pattern is the 2026 industry default — beats time-based cooldowns because it surfaces forgotten readers during rollout instead of after the drop. (Featureflow guide)

Extra safety: the _deprecated_ rename trick

For high-value columns (anything in or_projects, or_tasks, anything customer-facing), insert an extra ceremonial step between release N and release N+1:

Release N+0.5: rename column_name → _deprecated_column_name. Any forgotten reader will crash on a missing column instead of silently working until drop day. Cheap, durable canary.
Wait one deploy window.
Release N+1: drop the renamed column.

The rename itself triggers Squawk's renaming-column rule, so it needs the same -- squawk-ignore ceremony. Atlas v0.37+ supports this as a first-class pattern via the allow_column { match = "_deprecated_.+" } policy — we don't use Atlas, but the convention is the same. (Atlas v0.37 release notes)

Objective usage proof (advanced)

When grep + feature flag aren't enough — typically for columns referenced via dynamic SQL or agent-generated queries — query Postgres directly:

sql

-- Has this column been touched in the last 14 days?
SELECT query, calls, last_call
FROM pg_stat_statements
WHERE query ILIKE '%attempt_account_name_match%'
  AND last_call > now() - interval '14 days';

Zero rows = no live SQL traffic. Paste output into the contract PR description. Also worth scanning agent SQL-tool MLflow traces (see docs/sql-tool-architecture.md) since the agent generates SQL at runtime that won't appear in your grep.

Preventing forgotten contracts

The most common failure isn't dropping too early — it's never dropping at all, and the schema accumulates stale columns for years.

Suggested countermeasures:

Auto-file a follow-up ticket on expand merge. When a PR labeled migration adds a new column intended to replace an old one, the merge bot opens an ONRAMP-#### "Contract or_ramps.<col>" ticket with due date = today + 21 days and assignee = expand-PR author.
Quarterly stale-column report. Scheduled agent (we have AgentCore + intel runtime) runs the pg_stat_statements query above against every column not auto-classified in AUTO_EXPOSED_COLUMN_NAMES. Posts contract candidates to #engineering monthly.
Tracking field in expand PR template. "Contract follow-up ticket: ONRAMP-####" — required before expand can merge.

None of these are implemented yet — file an issue if you want to take one on.

Escape hatch syntax (mechanics)

The contract-migration template above uses Squawk's -- squawk-ignore directive. Mechanics worth knowing:

Directive applies to the next statement only.
Multiple rules comma-separated: -- squawk-ignore ban-drop-column,require-timeout-settings.
scripts/lint-migrations.sh strips the trailing ; and blank line that Alembic emits around op.execute(), so the directive lands flush against the target statement after rendering. You don't need to do anything special — write the comment, let the script normalize.
Never use -- squawk-ignore-file — too broad. Always name the specific rules.
The migration file header must reference the parent expand PR and Jira ticket (see template above).

If you find yourself adding squawk-ignore for any reason other than a contract-phase drop, stop and ask. There's almost always a more idiomatic expansion that doesn't need the ignore.

Migrations — Expand-and-Contract ​

The four phases ​

Anti-patterns Squawk catches ​

Running locally ​

Contract phase — how to safely drop ​

Quick start — the contract migration template ​

The five gates ​

Escape hatch syntax (mechanics) ​

Further reading ​