Skip to content

Migrations — Expand-and-Contract

OnRamp runs zero-downtime deploys against a live, multi-tenant Postgres. Schema changes that lock tables or break existing queries will cause 500s during the deploy window. Expand-and-contract is the pattern that prevents this.

Reference: Prisma expand-and-contract guide


The four phases

PhaseWhat happensOnRamp example
ExpandAdd the new structure alongside the old. Both old and new app code work.Add nullable attempt_account_name_match column to or_ramps with server_default.
Migrate dataBackfill existing rows; run in batches to avoid lock contention.UPDATE or_ramps SET attempt_account_name_match = true WHERE ... in a follow-on migration.
ContractRemove the old structure once no app code references it. Separate PR, separate deploy — see Contract phase.
CleanupRemove any dual-write compatibility shims in application code.Delete fallback logic that read from both old and new column.

Anti-patterns Squawk catches

Click for the full rule-by-rule table
RuleWhy it breaks zero-downtimeExpand-and-contract alternative
ban-drop-columnExisting queries/ORMs break mid-deploy if they reference the column.Ship code removal first; drop column in a later contract-phase migration.
ban-drop-tableSame as above at table scope.Remove all references from app code before dropping.
ban-drop-not-nullDropping NOT NULL on a column used by active queries can change query plans and behavior unexpectedly.Audit all callers first.
renaming-columnAny in-flight query using the old name fails immediately.Expand: add new column, dual-write, migrate, contract.
renaming-tableSame as renaming-column at table scope.Expand: new table, dual-write, contract.
changing-column-typePostgres may rewrite the entire table, holding an ACCESS EXCLUSIVE lock.Add new typed column, backfill, update app, drop old.
adding-required-fieldNOT NULL without a default triggers a full table rewrite (pre-PG 11) or blocks reads on large tables.Nullable first, backfill, add constraint with NOT VALID, validate separately.
constraint-missing-not-validValidating a constraint inline locks the table.ADD CONSTRAINT ... NOT VALID, then VALIDATE CONSTRAINT in a separate transaction.
require-concurrent-index-creationCREATE INDEX without CONCURRENTLY holds a write-blocking lock for the duration.Always use CREATE INDEX CONCURRENTLY. Requires op.execute() outside a transaction block.
require-concurrent-index-deletionSame issue for DROP INDEX.DROP INDEX CONCURRENTLY.
disallowed-unique-constraintADD CONSTRAINT ... UNIQUE blocks writes during build.CREATE UNIQUE INDEX CONCURRENTLY, then ADD CONSTRAINT ... USING INDEX.
transaction-nestingWrapping a concurrent-index operation in a transaction negates the concurrency.Run concurrent DDL outside BEGIN/COMMIT blocks.
prefer-robust-stmtsMigrations that fail halfway leave the DB in a partial state.Use IF EXISTS / IF NOT EXISTS guards on all DDL statements.

Full rule documentation: https://squawkhq.com/docs/


Running locally

bash
make lint.migrations           # offline, no DB required — same as CI
make lint.migrations.local     # also pings 127.0.0.1:5432/onramp_local for connectivity sanity

Both render new migrations to SQL via uv run flask db upgrade <down>:<new> --sql and pipe to bun run squawk. Neither needs schema knowledge from a live DB.


Contract phase — how to safely drop

TL;DR. Dropping is the dangerous half. Use two releases minimum: release N stops the app from reading/writing the column, release N+1 drops it. Squawk's ban-drop-column blocks accidents — the workflow below shows how to drop on purpose. Time waits are a floor, not a gate; the real proof is "zero traffic on the old path."

Quick start — the contract migration template

python
"""contract: drop or_ramps.attempt_account_name_match

Revision ID: ...
Revises: ...
Create Date: 2026-06-01 09:00:00.000000

Contract phase. Parent expand PR: #8543 (ONRAMP-4810).
Column removed from app code + ORM in PR #8612, deployed 2026-05-20.
Cooldown: 12 days; zero references in app/, onramp-agents/, app/ui-*.
"""

import sqlalchemy as sa
from alembic import op

revision = "..."
down_revision = "..."
branch_labels = None
depends_on = None


def upgrade():
    op.execute(
        "-- squawk-ignore ban-drop-column,require-timeout-settings,prefer-robust-stmts"
    )
    op.drop_column("or_ramps", "attempt_account_name_match")


def downgrade():
    # Re-add as nullable; backfill is lost.
    op.add_column(
        "or_ramps",
        sa.Column("attempt_account_name_match", sa.Boolean(), nullable=True),
    )

The five gates

A contract PR must satisfy all five before merge:

  • [ ] 1. Code path removed in a prior, already-deployed release. No reads, no writes, no analytics, no agent SQL-tool exposure.
  • [ ] 2. ORM ignores the column in that prior release. SQLAlchemy mapped_column deleted from *_models.py, or excluded via __mapper_args__. Done in release N — not the same PR as the drop.
  • [ ] 3. Cooldown observed. At minimum, the expand/migrate PR has been merged + deployed to prod ≥ 14 days. Preferably gated on a feature flag at 100% with zero error events.
  • [ ] 4. Grep evidence in PR description. rg "<column_name>" --type py --type vue --type ts onramp-agents/ app/ playwright/ returns zero hits outside migrations/.
  • [ ] 5. Second reviewer — platform / DBA owner, not the original author.

Each gate is independent. Skipping any one of them is how production incidents happen.

Why two releases? Why can't I drop in the same PR that removed the code?

Rolling deploys mean old pods keep serving traffic for minutes-to-hours after new pods come up. If release N+0 both removes the ORM mapping AND drops the column, the old pod's in-flight transactions still reference the column → 500s. Same applies to async workers, scheduled jobs, replication consumers, agent SQL-tool queries cached for a session.

Strong Migrations (Rails ecosystem) puts it this way: "Ignoring and dropping columns should not occur simultaneously in the same release — first ignore the column (release M), then drop it in the next release (release M+1)." Same logic applies to SQLAlchemy. (Strong Migrations docs)

Squawk's own ban-drop-column guidance: "Update your application code to no longer read or write the column," then "delete the column once queries no longer select or modify it" — explicitly two steps. (Squawk ban-drop-column)

The ignore-then-drop pattern in SQLAlchemy

Release N (expand-PR follow-up, before contract):

  1. Delete the mapped_column declaration entirely from app/api/<domain>/or_models.py. Or, if the column is referenced elsewhere transitively, mark it deferred=True and stop reading it.
  2. Remove every Model.column_name reference in services, helpers, controllers, DTOs, mappers.
  3. Remove every raw-SQL reference (search app/, onramp-agents/, migrations/ not-yet-applied).
  4. Ship + deploy to prod. Verify error rates clean for ≥ cooldown window.

Release N+1 (contract PR — the one in this template):

  • Migration file only. No app code changes. Drop the column.

Release N+2 (optional cleanup):

  • Remove any deferred markers or compat shims that bridged the gap.
Preferred: feature flag as the cooldown gate

Calendar waits are a heuristic. The real signal is zero traffic on the old path. A feature flag gives you that proof:

  1. Expand-phase PR introduces a flag (e.g., ramps.use-new-account-match-flow). Default off.
  2. Code reads/writes both columns when the flag is off; writes only the new column when on.
  3. Roll the flag to 1% → 10% → 100% over your normal feature-flag cadence.
  4. Observe error rates + agent-SQL-tool query patterns at 100% for 7 days.
  5. Open the contract PR. The "cooldown" is now provable: 7 days of 100%-flag traffic with no errors.

This pattern is the 2026 industry default — beats time-based cooldowns because it surfaces forgotten readers during rollout instead of after the drop. (Featureflow guide)

Extra safety: the _deprecated_ rename trick

For high-value columns (anything in or_projects, or_tasks, anything customer-facing), insert an extra ceremonial step between release N and release N+1:

  • Release N+0.5: rename column_name_deprecated_column_name. Any forgotten reader will crash on a missing column instead of silently working until drop day. Cheap, durable canary.
  • Wait one deploy window.
  • Release N+1: drop the renamed column.

The rename itself triggers Squawk's renaming-column rule, so it needs the same -- squawk-ignore ceremony. Atlas v0.37+ supports this as a first-class pattern via the allow_column { match = "_deprecated_.+" } policy — we don't use Atlas, but the convention is the same. (Atlas v0.37 release notes)

Objective usage proof (advanced)

When grep + feature flag aren't enough — typically for columns referenced via dynamic SQL or agent-generated queries — query Postgres directly:

sql
-- Has this column been touched in the last 14 days?
SELECT query, calls, last_call
FROM pg_stat_statements
WHERE query ILIKE '%attempt_account_name_match%'
  AND last_call > now() - interval '14 days';

Zero rows = no live SQL traffic. Paste output into the contract PR description. Also worth scanning agent SQL-tool MLflow traces (see docs/sql-tool-architecture.md) since the agent generates SQL at runtime that won't appear in your grep.

Preventing forgotten contracts

The most common failure isn't dropping too early — it's never dropping at all, and the schema accumulates stale columns for years.

Suggested countermeasures:

  1. Auto-file a follow-up ticket on expand merge. When a PR labeled migration adds a new column intended to replace an old one, the merge bot opens an ONRAMP-#### "Contract or_ramps.<col>" ticket with due date = today + 21 days and assignee = expand-PR author.
  2. Quarterly stale-column report. Scheduled agent (we have AgentCore + intel runtime) runs the pg_stat_statements query above against every column not auto-classified in AUTO_EXPOSED_COLUMN_NAMES. Posts contract candidates to #engineering monthly.
  3. Tracking field in expand PR template. "Contract follow-up ticket: ONRAMP-####" — required before expand can merge.

None of these are implemented yet — file an issue if you want to take one on.


Escape hatch syntax (mechanics)

The contract-migration template above uses Squawk's -- squawk-ignore directive. Mechanics worth knowing:

  • Directive applies to the next statement only.
  • Multiple rules comma-separated: -- squawk-ignore ban-drop-column,require-timeout-settings.
  • scripts/lint-migrations.sh strips the trailing ; and blank line that Alembic emits around op.execute(), so the directive lands flush against the target statement after rendering. You don't need to do anything special — write the comment, let the script normalize.
  • Never use -- squawk-ignore-file — too broad. Always name the specific rules.
  • The migration file header must reference the parent expand PR and Jira ticket (see template above).

If you find yourself adding squawk-ignore for any reason other than a contract-phase drop, stop and ask. There's almost always a more idiomatic expansion that doesn't need the ignore.


Further reading

Internal documentation — gated behind Cloudflare Access.