I Built a Multi-Tenant SaaS for 50+ Tenants — Here’s the Complete Architecture
Six months into building CitizenApp — a GDPR-compliant citizen management SaaS — a customer asked: “Are you sure my data is completely separate from other organisations using this?”
I said yes. Then I went and checked the code. I found three endpoints that could theoretically return cross-tenant data if a specific race condition hit. Nothing had leaked. But it could have.
That conversation triggered a complete rearchitecture of our tenant isolation stack. This post documents every significant decision I made — and what I’d do differently if I started over today.
The First Decision: Choosing Your Isolation Model
Everything downstream depends on how you isolate tenants at the data layer. Three options:
| Strategy | Isolation | Complexity | Migration cost | Cross-tenant queries |
|---|---|---|---|---|
tenant_id column | Logical | Low | Low | Easy |
| Schema per tenant | Physical | Medium | High | Hard |
| Database per tenant | Full | High | Very high | Impossible |
For CitizenApp, I chose tenant_id column on every table, enforced at two layers: the application repository and PostgreSQL Row-Level Security. Here’s why:
- Scale: Under 500 tenants at launch. Schema-per-tenant complexity pays off at 10,000+ tenants when you need schema-level customisation per client.
- GDPR compliance: Logical isolation with RLS is sufficient for GDPR. Physical isolation adds ops complexity without a compliance benefit at this scale.
- Query patterns: I need cross-tenant analytics in the admin dashboard. Schema-per-tenant makes this significantly harder.
- Migration tooling: Alembic works cleanly with a single schema. Per-schema migrations require custom tooling.
Every data model inherits from a base mixin:
class TenantMixin:
tenant_id: Mapped[int] = mapped_column(
ForeignKey("tenants.id"), nullable=False, index=True
)
class VatandasModel(TenantMixin, Base):
__tablename__ = "vatandas"
id: Mapped[int] = mapped_column(primary_key=True)
ad_soyad: Mapped[str] = mapped_column(String(200))
tc_no: Mapped[str] = mapped_column(String(11)) # encrypted at rest
The Lesson I Learned the Hard Way: Application-Layer Isolation Isn’t Enough
Most teams stop here:
@router.get("/users")
async def list_users(current_user: User = Depends(get_current_user), db: Session = Depends(get_db)):
return db.query(User).filter(User.tenant_id == current_user.tenant_id).all()
This looks safe. It isn’t — not as a guarantee:
- Forgotten filters: A new endpoint queries
Userwithout the tenant check. Easy mistake, invisible in tests. - Scope creep: An admin panel needs to see all users, so you bypass the filter. Now that code path exists, and someone copies it.
- N+1 relationships: You load users, loop through, load their audit logs. The second query forgets the tenant filter.
- Background jobs: A Celery task runs as a “system user” with
tenant_id = None. It can see everything.
These bugs are invisible until they’re exploited. Your tests pass because they run within a single tenant context.
The Fix: PostgreSQL Row-Level Security as the Second Enforcement Layer
RLS lives in the database. PostgreSQL evaluates it before returning any row. You cannot read data you’re not allowed to read — the database won’t let you, regardless of what application code runs.
Here’s the pattern:
-- Enable RLS on every tenant-scoped table
ALTER TABLE vatandas ENABLE ROW LEVEL SECURITY;
ALTER TABLE vatandas FORCE ROW LEVEL SECURITY;
-- Policy: only see rows matching the session's tenant
CREATE POLICY vatandas_tenant_isolation ON vatandas
FOR ALL
USING (tenant_id = current_setting('app.current_tenant_id')::int);
-- Superuser/admin bypass
CREATE POLICY vatandas_admin_all ON vatandas
FOR ALL
USING (current_setting('app.is_admin', true)::boolean = true);
The key is current_setting('app.current_tenant_id'). This is a PostgreSQL session variable your application sets after authentication. The database reads it on every query — automatically, everywhere.
Cascade this to every table. Once you set the session context once, every query across all tenant-scoped tables is filtered — JOINs, subqueries, transactions touching three tables:
ALTER TABLE audit_logs ENABLE ROW LEVEL SECURITY;
CREATE POLICY audit_tenant_isolation ON audit_logs
FOR ALL USING (tenant_id = current_setting('app.current_tenant_id')::int);
ALTER TABLE documents ENABLE ROW LEVEL SECURITY;
CREATE POLICY documents_tenant_isolation ON documents
FOR ALL USING (tenant_id = current_setting('app.current_tenant_id')::int);
Wiring RLS into FastAPI + SQLAlchemy
Set the context once per request, immediately after JWT verification:
from sqlalchemy import create_engine, text
from sqlalchemy.orm import sessionmaker, Session
engine = create_engine("postgresql://...", echo=False)
SessionLocal = sessionmaker(bind=engine)
def set_rls_context(session: Session, tenant_id: int, is_admin: bool = False):
"""Set PostgreSQL session variables for RLS enforcement."""
session.execute(text("SET LOCAL app.current_tenant_id = :tid"), {"tid": tenant_id})
session.execute(text("SET LOCAL app.is_admin = :admin"), {"admin": str(is_admin).lower()})
async def get_db(current_user: User = Depends(get_current_user)) -> Session:
session = SessionLocal()
try:
set_rls_context(
session,
tenant_id=current_user.tenant_id,
is_admin=current_user.role == "admin"
)
yield session
finally:
session.close()
Now your endpoints are clean — no manual tenant filtering:
@router.get("/citizens")
async def list_citizens(db: Session = Depends(get_db)):
# No .filter(tenant_id == ...) needed — RLS handles it
return db.query(VatandasModel).all()
The Connection Pooling Gotcha
Session variables don’t persist across physical connections in a connection pool. If you use PgBouncer or similar, a subsequent request may get a different connection with no context set.
Solution: Use SET LOCAL (not SET) — it scopes the variable to the current transaction. Alternatively, set it at the start of every request via middleware:
class TenantMiddleware(BaseHTTPMiddleware):
async def dispatch(self, request, call_next):
token = request.headers.get("authorization", "").replace("Bearer ", "")
payload = jwt.decode(token, settings.SECRET_KEY, algorithms=["HS256"])
request.state.tenant_id = payload["tenant_id"]
response = await call_next(request)
return response
The Migration Gotcha
Alembic migrations run as a privileged role. If FORCE ROW LEVEL SECURITY is set and that role has RLS applied, migrations can fail. Handle it:
# alembic/env.py
def run_migrations_online():
with connectable.connect() as connection:
# Temporarily bypass RLS for migration scripts
connection.execute(text("SET app.current_tenant_id = '0'"))
connection.execute(text("SET app.is_admin = 'true'"))
context.configure(connection=connection, target_metadata=target_metadata)
with context.begin_transaction():
context.run_migrations()
Local Development Gotcha
current_setting() raises an error if not set (unless you use the two-argument form current_setting('app.current_tenant_id', true) which returns NULL). A NULL tenant_id means zero rows returned — which is the safe default, but confusing in dev. Always set a test context in fixtures:
@pytest.fixture
def db_with_rls():
session = SessionLocal()
set_rls_context(session, tenant_id=1)
yield session
session.close()
JWT Design: What’s in the Token and Why
Token design has permanent consequences. Tokens issued today will be used until they expire — you can’t retroactively change their structure without forcing re-login.
CitizenApp’s access token payload:
{
"sub": "42", # user_id
"tenant_id": 7, # embedded — no DB lookup per request
"role": "operator", # embedded — no DB lookup per request
"email": "u@co.com", # for audit logging without a DB hit
"exp": 1716307200, # 15-minute expiry
"type": "access"
}
Why embed tenant_id and role directly: Every request needs these. If they’re not in the token, every API call requires a database lookup. At 100 requests/second, that’s 100 unnecessary DB queries/second. Token-embedded claims mean zero DB round-trips for auth.
The tradeoff: A role change takes up to 15 minutes to propagate. Acceptable for CitizenApp. If you need immediate revocation, use short-lived tokens (5 minutes) plus a Redis blocklist.
Refresh token rotation with theft detection:
async def rotate_refresh_token(db: AsyncSession, old_token: str) -> tuple[str, str]:
stored = await get_refresh_token(db, old_token)
if not stored or stored.is_revoked:
# Token reuse detected — possible theft
await revoke_token_family(db, stored.family_id)
raise HTTPException(status_code=401, detail="Token reuse detected")
await revoke_token(db, old_token)
new_access = create_access_token(stored.user_id, stored.tenant_id)
new_refresh = await create_refresh_token(db, stored.user_id, stored.family_id)
return new_access, new_refresh
If a refresh token is used twice, the entire token family is revoked. This stops an attacker even if they intercepted the token first.
RBAC: Three Roles, One Decision Point
CitizenApp has three roles: admin, operator, viewer. I deliberately did not implement fine-grained permissions.
Why not: Premature flexibility. At three roles with clear semantics, a permission matrix adds indirection without value. The hierarchy:
viewer→ read-only access to citizen dataoperator→ read + write citizen data, run importsadmin→ everything + user management + billing + tenant settings
def require_role(*roles: str):
def dependency(current_user: UserModel = Depends(get_current_user)):
if current_user.role not in roles:
raise HTTPException(status_code=403, detail="Insufficient permissions")
return current_user
return dependency
@router.delete("/citizens/{id}")
async def delete_citizen(
id: int,
user: UserModel = Depends(require_role("admin", "operator")),
repo: CitizenRepository = Depends(get_citizen_repo),
):
...
Clean, readable, testable. When the permission model needs to evolve, it’s one function.
Encryption at Rest: Fernet Over pgcrypto
PII like the TC identity number (Turkish national ID) must be encrypted at rest. Three options:
- Application-level (Fernet/AES) — encrypt in Python before INSERT
- pgcrypto — encrypt at the database level
- Transparent Data Encryption — filesystem/disk level
I chose application-level Fernet:
from cryptography.fernet import Fernet
class EncryptionService:
def __init__(self, key: bytes):
self.fernet = Fernet(key)
def encrypt(self, value: str) -> str:
return self.fernet.encrypt(value.encode()).decode()
def decrypt(self, value: str) -> str:
return self.fernet.decrypt(value.encode()).decode()
# SQLAlchemy event listener — transparent to the rest of the code
@event.listens_for(VatandasModel.tc_no, "set")
def encrypt_tc_no(target, value, oldvalue, initiator):
if value and not value.startswith("gAAAAA"): # not already encrypted
target.tc_no = encryption_service.encrypt(value)
Why not pgcrypto: The encryption key must be available to PostgreSQL, meaning it’s accessible to anyone with database access. Application-level keeps the key in the application environment (Render secrets), completely separate from the database.
Key rotation is zero-downtime with MultiFernet:
# During rotation: old ciphertext decrypts with old key, new data uses new key
fernet = MultiFernet([new_key_fernet, old_key_fernet])
Async SQLAlchemy: The Patterns That Bit Me
CitizenApp uses FastAPI + asyncpg + async SQLAlchemy. Correct choice, but with sharp edges.
Session scoping — auto-commit/rollback:
async def get_db() -> AsyncGenerator[AsyncSession, None]:
async with AsyncSessionLocal() as session:
try:
yield session
await session.commit()
except Exception:
await session.rollback()
raise
The N+1 trap — async doesn’t protect you from it:
# BAD: N queries inside a loop
citizens = await repo.get_all()
for c in citizens:
c.audit_count = await audit_repo.count_for(c.id) # N extra queries
# GOOD: single query with aggregation
stmt = (
select(VatandasModel, func.count(AuditModel.id).label("audit_count"))
.outerjoin(AuditModel, AuditModel.citizen_id == VatandasModel.id)
.where(VatandasModel.tenant_id == tenant_id)
.group_by(VatandasModel.id)
)
The lazy loading trap — async SQLAlchemy doesn’t support lazy loading. Accessing a relationship outside a transaction raises MissingGreenlet. Declare what you need upfront:
stmt = (
select(UserModel)
.options(selectinload(UserModel.tenant)) # explicit eager load
.where(UserModel.id == user_id)
)
Alembic Migrations: Zero-Downtime Rules
Every schema change is an Alembic migration. No exceptions.
Three rules I follow strictly:
1. Never rename a column directly. Add the new column → backfill → update application code → drop old column in a separate migration.
2. New NOT NULL columns need a DEFAULT or two-step migration. Adding NOT NULL without a default takes a table lock.
3. Always create indexes CONCURRENTLY:
# In Alembic migration — standard CREATE INDEX locks the table
op.execute(
"CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_vatandas_tenant_created "
"ON vatandas(tenant_id, created_at DESC)"
)
Alembic doesn’t generate CONCURRENTLY by default. Write it manually.
Frontend: React 19 + TanStack Query
State management: No Redux, no Zustand. TanStack Query for server state, useState for local UI state. The vast majority of React state in a SaaS is server data — citizen lists, audit logs, user lists. TanStack Query handles all of it.
React 19 useOptimistic for instant UX:
const [optimisticCitizens, addOptimistic] = useOptimistic(citizens);
async function handleDelete(id: number) {
addOptimistic(prev => prev.filter(c => c.id !== id)); // instant UI update
try {
await deleteCitizen(id);
await queryClient.invalidateQueries(["citizens"]);
} catch {
// useOptimistic rolls back automatically on error
toast.error("Delete failed");
}
}
The UI updates immediately. The API catches up. If the call fails, the optimistic update rolls back automatically — no manual state cleanup.
Tenant context from JWT (frontend-side):
function useTenant() {
const token = localStorage.getItem("access_token");
if (!token) return null;
const payload = JSON.parse(atob(token.split(".")[1]));
return { id: payload.tenant_id, plan: payload.plan };
}
The frontend decodes the JWT without verification (signature verification is the server’s job) to personalise the UI. The backend always validates the token independently — showing wrong UI data is a UX problem, returning wrong API data is a security problem.
What I’d Change Starting Over
1. RLS from day one.
Retrofitting RLS on an existing production schema with live tenants is painful. It should be in the first migration.
2. Separate read and write models.
List endpoints do too much: filter, sort, paginate, aggregate audit counts. Dedicated read models or database views optimised for list queries would be cleaner than the combined write models handling both.
3. Formal API versioning from day one.
CitizenApp is on v1 with no formal versioning. For multi-tenant SaaS where different tenants may be on different frontend versions, /api/v1/ from the start with a migration plan matters.
4. Event sourcing for audit.
The current audit log records events but not full before/after state. Reconstructing “what did this record look like at this point in time” requires replaying events. Full snapshots in the audit table would make compliance reporting significantly easier.
The Result
After implementing RLS alongside the repository pattern:
- Zero cross-tenant data incidents in 6 months of production
- Application bugs that would have been security holes became no-ops — the database simply returned nothing
- New developers on the team can’t accidentally bypass isolation — the DB layer doesn’t let them
Multi-tenant architecture is about defaults that are safe by construction. The goal isn’t to write perfect code — it’s to make dangerous mistakes structurally impossible.
Building a multi-tenant SaaS and want a second opinion on the architecture? I’m available for consulting.