How I Built 9 Claude AI Features into a Production SaaS

Nine AI features. One production SaaS. Zero hallucinated data in the user’s face.

That last part is harder than it sounds.

This is a full technical breakdown of how I designed, built, and shipped nine Claude Haiku features into CitizenApp — a GDPR-compliant citizen management platform. I’ll cover the architecture, the security decisions, the failure modes I had to defend against, and what I’d do differently.

Why Haiku, Not Sonnet or Opus

Every AI feature in CitizenApp uses claude-haiku-4-5. The reasoning was deliberate:

Cost. This is a multi-tenant SaaS. Each AI call deducts from the tenant’s credit balance. Haiku is significantly cheaper per token than Sonnet — which means I can offer meaningful AI functionality at the Free tier without destroying margins.

Speed. Most features are interactive — the user is waiting. Haiku’s latency is typically under 2 seconds for the prompt sizes I’m sending. Sonnet would double or triple that.

Sufficient capability. For structured tasks — SQL generation, data summarisation, duplicate record comparison — Haiku is good enough. I don’t need Opus-level reasoning. I need fast, cheap, reliable JSON out.

Consistent timeout budget. Every AI call has a 20-second timeout. Haiku almost always completes in under 5. That matters when you’re running on Render’s free tier.

The Credit System

Before the features themselves: every AI call in CitizenApp costs credits. This is enforced at the API layer, not the frontend.

async def deduct_credits(db: AsyncSession, tenant_id: int, amount: int = 1) -> None:
    tenant = await db.get(TenantModel, tenant_id)
    if tenant.ai_credits < amount:
        raise HTTPException(status_code=402, detail="Insufficient AI credits")
    tenant.ai_credits -= amount
    await db.commit()

Simple. No credits, no AI call. The 402 Payment Required response tells the frontend to show the upgrade prompt. Credits are deducted before the Claude API call — if the call fails, I refund them. This prevents users from draining credits on broken requests.

Stats queries cost 1 credit. Live DB queries cost 2. The pricing reflects actual Claude token consumption.

Feature 1: CSV Column Mapping

The problem: Users import citizen data from spreadsheets. The column headers are inconsistent — TC No, tc_no, Kimlik No, ID Number — all mean the same thing. Asking users to manually map every import is a UX failure.

The implementation:

async def ai_map_columns(headers: list[str]) -> dict:
    prompt = f"""You are mapping CSV column headers to a fixed schema.
    
Schema fields: tc_no, ad_soyad, dogum_tarihi, cinsiyet
Headers: {headers}

Return JSON only:
{{"tc_no": "header or null", "ad_soyad": "header or null", 
  "dogum_tarihi": "header or null", "cinsiyet": "header or null"}}"""

    response = await anthropic_client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[{"role": "user", "content": prompt}]
    )
    return json.loads(response.content[0].text)

What I learned: Keep the schema description short and the output format rigid. Early versions asked Claude to “suggest the best mapping” — vague enough that it would sometimes return explanatory text instead of JSON. Specifying Return JSON only: with the exact structure eliminated this.

Failure mode handled: If json.loads throws, fall back to fuzzy string matching. AI is the first attempt, not the only one.

Feature 2: Natural Language Search

The problem: Users need to filter 50,000+ citizen records. SQL is not the interface.

The implementation:

ALLOWED_COLUMNS = {"ad_soyad", "dogum_tarihi", "cinsiyet", "created_at", "il", "ilce"}

async def nl_to_sql_filter(query: str, tenant_id: int) -> str:
    prompt = f"""Convert this natural language query to a PostgreSQL WHERE clause.

Table: vatandas
Available columns: {', '.join(ALLOWABLE_COLUMNS)}
tenant_id is always {tenant_id} — always include it.

Query: "{query}"

Return only the WHERE clause starting with 'WHERE'. No explanation."""

    response = await anthropic_client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=300,
        messages=[{"role": "user", "content": prompt}]
    )
    
    clause = response.content[0].text.strip()
    validate_sql_clause(clause)  # raises on dangerous patterns
    return clause

The critical part — validate_sql_clause:

BLOCKED = ["drop", "delete", "update", "insert", "alter", "exec", 
           "--", ";", "union", "sleep", "pg_"]

def validate_sql_clause(clause: str) -> None:
    lower = clause.lower()
    if not lower.startswith("where"):
        raise ValueError("Invalid clause")
    for pattern in BLOCKED:
        if pattern in lower:
            raise ValueError(f"Blocked pattern: {pattern}")

Never trust LLM output with SQL. The allowed columns list is in the prompt — if Claude generates a column not in that list, SQLAlchemy will raise a column-not-found error before anything executes. Defense in depth.

Feature 3: Anomaly Detection

The problem: Bad data accumulates. Future birthdates, 1890-born citizens, missing gender on 40% of records. Users need to find it without writing reports.

The implementation: I run five deterministic checks in Python — no AI involved in detecting the anomalies:

def detect_anomalies(citizens: list[dict]) -> list[dict]:
    issues = []
    today = date.today()
    
    for c in citizens:
        if c["dogum_tarihi"] and c["dogum_tarihi"] > today:
            issues.append({"id": c["id"], "type": "future_dob", "severity": "high"})
        if c["dogum_tarihi"] and c["dogum_tarihi"].year < 1900:
            issues.append({"id": c["id"], "type": "implausible_dob", "severity": "medium"})
        if not c["cinsiyet"]:
            issues.append({"id": c["id"], "type": "missing_gender", "severity": "low"})
    
    return issues

Claude’s role is to summarise the results:

summary_prompt = f"""Summarise these data quality issues in 2-3 sentences for a non-technical admin:
{json.dumps(issue_counts)}
Be specific about numbers. Suggest the highest-priority fix first."""

This is the right division of labour. Python finds the issues deterministically. Claude explains them in plain language. AI never makes the binary “is this an anomaly” judgment — it just communicates results.

Feature 4: AI Daily Briefing

The simplest feature architecturally. Every morning, the admin dashboard can show a 3-4 sentence Claude-generated briefing:

stats = {
    "total_records": 12847,
    "missing_gender": 342,
    "missing_dob": 89,
    "potential_duplicates": 23,
    "created_last_7_days": 156
}

prompt = f"""You are a data assistant. Write a 3-4 sentence briefing for an admin 
about the current state of their citizen database. Be concise and actionable.

Stats: {json.dumps(stats)}"""

Lesson: Don’t send raw PII to the API. Stats are safe. Individual citizen records are not. I aggregate before sending.

Feature 5: AI Duplicate Merge

This was the most complex feature to get right.

The problem: The duplicate detection identifies potential duplicates by similarity score. But merging them requires judgment — which record has the correct spelling? Which birthdate is more likely accurate?

The implementation:

async def ai_merge_suggest(record_a: dict, record_b: dict) -> dict:
    # Mask TC no before sending to Claude
    safe_a = {k: v for k, v in record_a.items() if k != "tc_no"}
    safe_b = {k: v for k, v in record_b.items() if k != "tc_no"}
    
    prompt = f"""Compare these two citizen records and suggest the best merged version.

Record A (id={record_a['id']}): {json.dumps(safe_a)}
Record B (id={record_b['id']}): {json.dumps(safe_b)}

Return JSON:
{{
  "keep_id": <id of the record to keep as primary>,
  "merged": {{"ad_soyad": "...", "dogum_tarihi": "YYYY-MM-DD", "cinsiyet": "E/K"}},
  "confidence": "high|medium|low",
  "reasoning": "one sentence explanation"
}}"""

TC identity number (TC kimlik no) is always masked before sending to Claude. This is non-negotiable. It’s PII, it’s regulated, and it’s not needed for the merge decision. The keep_id in the response tells the backend which record to promote — Claude never sees or decides on the TC number.

Claude’s output is never applied automatically. It’s a suggestion displayed to the admin, who clicks “Apply” to confirm. The AI is an advisor, not an actor.

Feature 6: Natural Language Reports

The problem: Admins need custom data exports. “Give me all female citizens born between 1960 and 1970 in Ankara” is not a filter the UI can anticipate.

The implementation: A dedicated /reports page accepts a natural language query, sends it to Claude for SQL filter generation (same pattern as Feature 2), shows a preview table, then allows CSV/TSV export — all client-side using Blob and URL.createObjectURL.

The backend generates the filtered dataset. The download happens in the browser. No server-side file generation, no temp file cleanup, no storage costs.

Features 7 & 8: AI Chat — Stats and Live DB Query

The floating chat widget was the most technically interesting feature.

Two modes in one endpoint:

Stats mode — Claude answers questions from aggregate statistics only. “How many records are missing a birthdate?” The answer comes from pre-computed stats, not a live query.
DB query mode — For questions that can’t be answered from stats, Claude generates a SQL SELECT statement that the backend executes directly.

The DB query mode required careful safety engineering:

SAFE_QUERY_RULES = """
Generate a PostgreSQL SELECT query. Hard rules:
- Only query table: vatandas
- Never include column: tc_no  
- Always filter: WHERE tenant_id = {tenant_id}
- Always add: LIMIT 100
- No subqueries, no JOINs to other tables
- No aggregations that could expose cross-tenant data
"""

def validate_generated_query(sql: str, tenant_id: int) -> str:
    sql_lower = sql.lower().strip()
    
    assert sql_lower.startswith("select"), "Must be SELECT"
    assert "tc_no" not in sql_lower, "tc_no is blocked"
    assert f"tenant_id = {tenant_id}" in sql_lower, "tenant_id filter required"
    assert "limit" in sql_lower, "LIMIT required"
    
    # Extract and enforce limit
    limit_match = re.search(r"limit\s+(\d+)", sql_lower)
    if limit_match and int(limit_match.group(1)) > 100:
        sql = re.sub(r"limit\s+\d+", "LIMIT 100", sql, flags=re.IGNORECASE)
    
    return sql

The query executes with a 5-second statement timeout at the database level. If Claude generates something slow, it gets cancelled.

Query result display: Results are shown in a collapsible QueryBlock with the SQL visible above the data table. Users can see exactly what ran. Transparency matters when AI is touching your database.

Feature 9: AI Audit Trail Explainer

The audit log records 30+ event types — citizen_created, bulk_import, login_failed, 2fa_enabled, etc. Non-technical admins can’t interpret raw event streams.

async def explain_audit_events(events: list[dict]) -> str:
    # Strip internal IDs and technical fields
    safe_events = [
        {"type": e["event_type"], "user": e["user_email"], "time": e["created_at"]}
        for e in events[:20]  # Last 20 events only
    ]
    
    prompt = f"""Summarise this audit log in plain language for a non-technical admin.
Note any unusual patterns (multiple failed logins, bulk operations, permission changes).
2-4 sentences maximum.

Events: {json.dumps(safe_events)}"""

Cap at 20 events. Don’t send the full log. Keep the token budget predictable.

What I’d Do Differently

1. Add structured output from the start.
Anthropic’s tool use / structured output API is cleaner than prompt engineering Return JSON only. I retrofitted it on some features. It should have been the default.

2. Cache more aggressively.
The daily briefing doesn’t need to be generated every page load. A 6-hour Redis cache with background refresh would cut AI costs significantly. Currently it regenerates on every dashboard open.

3. Separate AI service layer.
All nine features call Claude via a thin wrapper in app/services/ai.py. This is good for centralised error handling and logging. What I should have added sooner: a proper prompt registry so prompts are versioned and testable independently of the endpoint logic.

4. Test AI outputs, not just AI calls.
My test suite mocks the Anthropic client. That’s necessary but insufficient — it doesn’t catch prompt regressions. A small set of golden-output integration tests against real Claude (run weekly, not on every CI run) would give me confidence that prompt changes don’t silently break feature behaviour.

The Result

Nine features, all in production, all credit-metered, all with safety validation on AI outputs. The platform handles real data for real users.

The engineering pattern that made this possible: AI generates, Python validates, humans confirm. The LLM is never the last line of defence.

Building AI features into your product? Let’s talk about what that looks like.