Building multi-tenant SaaS is a sequence of architecture decisions where the wrong choices compound. A tenant isolation model that seemed reasonable at 10 tenants becomes a migration nightmare at 1000. An auth pattern that “works” creates a security gap under load.
This post documents every significant architecture decision in CitizenApp — a GDPR-compliant citizen management SaaS — and the reasoning behind each one. I’ll be direct about the tradeoffs and what I’d do differently.
Tenant Isolation: Row-Level vs Schema-Per-Tenant vs Database-Per-Tenant
This is the first decision that constrains everything downstream.
Three options:
| Strategy | Isolation | Complexity | Migration cost | Cross-tenant queries |
|---|---|---|---|---|
tenant_id column | Logical | Low | Low | Easy |
| Schema per tenant | Physical | Medium | High | Hard |
| Database per tenant | Full | High | Very high | Impossible |
For CitizenApp, I chose tenant_id column on every table. Here’s the reasoning:
- Scale: Under 500 tenants at launch. Schema-per-tenant is worth the complexity at 10,000+ tenants when you need schema-level customisation.
- GDPR compliance: Logical isolation is sufficient for GDPR if implemented correctly. Physical isolation adds ops complexity without a compliance benefit at this scale.
- Query patterns: I need cross-tenant analytics for the admin dashboard (aggregate counts, not row data). Schema-per-tenant makes this significantly harder.
- Migration tooling: Alembic works cleanly with a single schema. Per-schema migrations require custom tooling.
The implementation:
Every data model inherits from a base that enforces tenant_id:
class TenantMixin:
tenant_id: Mapped[int] = mapped_column(
ForeignKey("tenants.id"), nullable=False, index=True
)
class VatandasModel(TenantMixin, Base):
__tablename__ = "vatandas"
id: Mapped[int] = mapped_column(primary_key=True)
ad_soyad: Mapped[str] = mapped_column(String(200))
tc_no: Mapped[str] = mapped_column(String(11)) # encrypted at rest
# ...
The enforcement layer — Repository pattern:
class CitizenRepository:
def __init__(self, db: AsyncSession, tenant_id: int):
self.db = db
self.tenant_id = tenant_id
async def get_all(self, skip: int = 0, limit: int = 100) -> list[VatandasModel]:
result = await self.db.execute(
select(VatandasModel)
.where(VatandasModel.tenant_id == self.tenant_id)
.offset(skip)
.limit(limit)
)
return result.scalars().all()
async def get_by_id(self, citizen_id: int) -> VatandasModel | None:
result = await self.db.execute(
select(VatandasModel)
.where(
VatandasModel.id == citizen_id,
VatandasModel.tenant_id == self.tenant_id # always scoped
)
)
return result.scalar_one_or_none()
The tenant_id filter is in the repository, not the endpoint. Endpoints never construct raw queries — they call repository methods. This makes it structurally difficult to accidentally return cross-tenant data.
What I’d add now: PostgreSQL Row-Level Security policies as a second enforcement layer. Even if application code has a bug, RLS at the database level ensures no cross-tenant data leaks. It’s belt-and-suspenders isolation.
JWT Design: What’s in the Token and Why
Token design has permanent consequences. Tokens issued today will be used until they expire — you can’t retroactively change their structure without forcing re-login.
CitizenApp’s access token payload:
{
"sub": "42", # user_id
"tenant_id": 7, # tenant scope — embedded, not looked up per request
"role": "operator", # rbac role — same reasoning
"email": "user@co.com", # for audit logging without a DB lookup
"exp": 1716307200, # 15 minutes
"type": "access"
}
Why tenant_id and role are in the token:
Every request needs these. If they’re not in the token, every API call requires a database lookup to find the user’s tenant and role. At 100 requests/second, that’s 100 unnecessary DB queries/second. With token-embedded claims, the FastAPI middleware extracts everything it needs from the token signature verification — zero DB round-trips for auth.
The tradeoff: If you change a user’s role, the old token is still valid for up to 15 minutes. This is acceptable for CitizenApp. If immediate revocation were required, I’d use short-lived tokens (5 minutes) plus a Redis token blocklist.
Refresh token rotation:
async def rotate_refresh_token(
db: AsyncSession,
old_token: str
) -> tuple[str, str]:
stored = await get_refresh_token(db, old_token)
if not stored or stored.is_revoked:
# Possible token theft — revoke entire family
await revoke_token_family(db, stored.family_id)
raise HTTPException(status_code=401, detail="Token reuse detected")
await revoke_token(db, old_token)
new_access = create_access_token(stored.user_id, stored.tenant_id)
new_refresh = await create_refresh_token(db, stored.user_id, stored.family_id)
return new_access, new_refresh
Refresh tokens use family-based rotation. If a refresh token is used twice (token theft scenario), the entire family is revoked. This forces re-login and stops the attacker even if they got the token first.
RBAC: Three Roles, One Decision Point
CitizenApp has three roles: admin, operator, viewer.
What I didn’t do: Fine-grained permissions per action. CASL, custom permission tables, or middleware that checks user.permissions.includes("citizen:delete").
Why not: Premature flexibility. At three roles with clear semantics, a permission matrix adds indirection without value. The role hierarchy is:
viewer→ read-only access to citizen dataoperator→ read + write citizen data, run importsadmin→ everything + user management + billing + tenant settings
The FastAPI implementation:
def require_role(*roles: str):
def dependency(current_user: UserModel = Depends(get_current_user)):
if current_user.role not in roles:
raise HTTPException(status_code=403, detail="Insufficient permissions")
return current_user
return dependency
# Usage
@router.delete("/citizens/{id}")
async def delete_citizen(
id: int,
user: UserModel = Depends(require_role("admin", "operator")),
repo: CitizenRepository = Depends(get_citizen_repo),
):
...
@router.get("/users")
async def list_users(
user: UserModel = Depends(require_role("admin")),
...
):
...
Clean, readable, testable. When the permission model needs to evolve, it’s one function.
The audit trail connects here: Every destructive action logs user_id, tenant_id, action, target_id, and before/after states. Compliance requires knowing who did what to which record at what time. The role doesn’t change the requirement — it just determines whether the action was permitted.
Encryption at Rest: Fernet Over Column-Level Encryption
PII like the TC identity number (Turkish national ID) must be encrypted at rest. Three options:
- Application-level encryption (Fernet/AES) — encrypt in Python before INSERT
- PostgreSQL column encryption (pgcrypto) — encrypt at the database level
- Transparent Data Encryption — encrypt at the filesystem/disk level
I chose application-level Fernet encryption:
from cryptography.fernet import Fernet
class EncryptionService:
def __init__(self, key: bytes):
self.fernet = Fernet(key)
def encrypt(self, value: str) -> str:
return self.fernet.encrypt(value.encode()).decode()
def decrypt(self, value: str) -> str:
return self.fernet.decrypt(value.encode()).decode()
# SQLAlchemy event listener — transparent to the rest of the code
@event.listens_for(VatandasModel.tc_no, "set")
def encrypt_tc_no(target, value, oldvalue, initiator):
if value and not value.startswith("gAAAAA"): # not already encrypted
target.tc_no = encryption_service.encrypt(value)
Why not pgcrypto: pgcrypto requires the encryption key to be available to PostgreSQL. That means it’s accessible to anyone with database access. Application-level encryption keeps the key in the application environment (Render secrets), separate from the database.
Why not TDE: TDE protects against physical disk theft. It doesn’t protect against a compromised database user or a SQL injection that returns raw table data. Application-level encryption does.
Key rotation: Fernet supports key rotation via MultiFernet. Old ciphertext decrypts with old key; new data encrypts with new key. Zero-downtime rotation:
# During rotation window
fernet = MultiFernet([new_key_fernet, old_key_fernet])
Database: Async SQLAlchemy Patterns
CitizenApp uses FastAPI with asyncpg and async SQLAlchemy. This is the right choice but it has sharp edges.
Session scoping:
async def get_db() -> AsyncGenerator[AsyncSession, None]:
async with AsyncSessionLocal() as session:
try:
yield session
await session.commit()
except Exception:
await session.rollback()
raise
Each request gets its own session. Auto-commit on success, auto-rollback on exception. No manual transaction management in endpoint code.
The N+1 trap:
# BAD: N+1 queries
citizens = await repo.get_all()
for c in citizens:
c.audit_count = await audit_repo.count_for(c.id) # N extra queries
# GOOD: single query with subquery
stmt = (
select(VatandasModel, func.count(AuditModel.id).label("audit_count"))
.outerjoin(AuditModel, AuditModel.citizen_id == VatandasModel.id)
.where(VatandasModel.tenant_id == tenant_id)
.group_by(VatandasModel.id)
)
Async SQLAlchemy doesn’t protect you from N+1. You have to think about it explicitly. I caught several in code review — a query in a loop is always a red flag.
The lazy loading trap:
Async SQLAlchemy doesn’t support lazy loading. Any relationship access outside a transaction raises MissingGreenlet. You must declare what you need upfront:
stmt = (
select(UserModel)
.options(selectinload(UserModel.tenant)) # explicit eager load
.where(UserModel.id == user_id)
)
Migration Strategy: Alembic in Production
Every schema change in CitizenApp is an Alembic migration. No exceptions. No CREATE TABLE in the application startup code.
The pattern that works:
# Never autogenerate directly to production
alembic revision --autogenerate -m "add_ai_credits_to_users"
# Review the generated file
# Add explicit data migrations if needed
alembic upgrade head
Zero-downtime migrations — the rules:
- Never rename a column directly. Add the new column, backfill, update application code, then drop the old column in a separate migration run.
- New NOT NULL columns need a DEFAULT or a two-step migration. Adding
NOT NULLwithout a default locks the table. - Index creation should be CONCURRENT.
CREATE INDEX CONCURRENTLYdoesn’t lock reads/writes. StandardCREATE INDEXdoes.
# In Alembic migration — concurrent index
op.execute("CREATE INDEX CONCURRENTLY IF NOT EXISTS ix_vatandas_tenant_created ON vatandas(tenant_id, created_at DESC)")
Alembic doesn’t generate CONCURRENTLY by default. You have to write it manually.
Frontend Architecture: React 19 + TanStack Query
State management decision: No Redux, no Zustand. TanStack Query for server state, useState for local UI state.
The vast majority of React state is server data — citizen lists, user lists, audit logs. TanStack Query handles fetching, caching, background refresh, and optimistic updates. There’s no global client-side state that needs a store.
React 19 useOptimistic for instant UX:
const [optimisticCitizens, addOptimistic] = useOptimistic(citizens);
async function handleDelete(id: number) {
addOptimistic(prev => prev.filter(c => c.id !== id)); // instant
try {
await deleteCitizen(id);
await queryClient.invalidateQueries(["citizens"]);
} catch {
// useOptimistic rolls back automatically on error
toast.error("Delete failed");
}
}
The UI updates immediately. If the API call fails, the optimistic update rolls back. Users get instant feedback without fake loading states.
Tenant context:
The tenant is embedded in the JWT. The frontend decodes it (without verification — signature verification is the server’s job) to personalise the UI:
function useTenant() {
const token = localStorage.getItem("access_token");
if (!token) return null;
const payload = JSON.parse(atob(token.split(".")[1]));
return { id: payload.tenant_id, plan: payload.plan };
}
This is fine. The frontend showing wrong data is a UX problem. The backend returning wrong data would be a security problem. The backend always validates the token independently.
What I’d Change
1. PostgreSQL RLS from day one.
Application-level tenant_id filtering is correct but not sufficient as a guarantee. RLS at the database layer would make it impossible for a bug in the repository layer to expose cross-tenant data. Retrofitting RLS on an existing schema is painful — it should be designed in from the start.
2. Separate read and write models (CQRS-lite).
The list endpoints in CitizenApp do a lot of work: filter, sort, paginate, aggregate audit counts. This would be cleaner with dedicated read models or database views optimised for list queries, separate from the write models.
3. Stricter API versioning.
CitizenApp is currently on v1 with no formal versioning. For a multi-tenant SaaS where different tenants may be on different frontend versions, /api/v1/ from the start with a migration plan for v2 would have been smarter.
4. Event sourcing for audit.
The audit log records events but not the full before/after state for every field. Reconstructing “what did this record look like at this point in time” requires replaying events. A proper event-sourced audit table with full snapshots would make compliance reporting significantly easier.
The Bottom Line
Multi-tenant SaaS architecture is about defaults that are safe by construction. Tenant isolation in the repository layer, role enforcement in dependencies, encryption at the ORM level — the goal is to make the dangerous thing hard to do accidentally.
Every architectural decision in CitizenApp was made with the assumption that I would make mistakes later. The architecture’s job is to make those mistakes non-catastrophic.
Building a multi-tenant SaaS and want a second opinion on the architecture? I’m available for consulting.