d&d-t System Design Deep Dive ·

🎯 Design Multi-tenant System

1️⃣ Core Framework

When discussing Multi-tenant System design, I frame it as:

Tenant model and isolation level
Authentication and authorization
Data partitioning strategy
Tenant-aware application design
Configuration and feature isolation
Resource quotas and noisy-neighbor control
Security, compliance, and auditing
Trade-offs: isolation vs cost vs scalability

2️⃣ Core Requirements

Functional Requirements

Support multiple tenants/customers
Each tenant has users, roles, and permissions
Each tenant’s data must be isolated
Support tenant-specific configuration
Support tenant-specific feature flags
Support tenant-level billing and usage tracking
Support tenant onboarding/offboarding
Support admin operations per tenant
Support audit logs per tenant

Non-functional Requirements

Strong data isolation
High availability
Scalable tenant growth
Secure access control
Good operational visibility
Cost-efficient shared infrastructure
Prevent noisy-neighbor impact
Support compliance and audit requirements

👉 Interview Answer

A multi-tenant system serves multiple customers on shared infrastructure.

The most important design goal is tenant isolation.

Each tenant’s data, permissions, configuration, usage, and operational visibility must be separated, even if the infrastructure is shared.

3️⃣ Core Concepts

Tenant

A tenant is a customer or organization.

Example:

tenant_id = company_123

Tenant User

A user belongs to one or more tenants.

user_id = u456
tenant_id = company_123
role = admin

Tenant Isolation

Isolation can apply to:

Data
Authentication
Authorization
Configuration
Compute resources
Storage resources
Logs and metrics
Billing

👉 Interview Answer

I would treat tenant_id as a first-class concept.

Every request, every data record, every authorization check, and every log entry should be tenant-aware.

This reduces the risk of cross-tenant data leakage.

4️⃣ Main APIs

Create Tenant

POST /api/tenants

Request:

{
  "name": "Acme Corp",
  "plan": "enterprise",
  "region": "us-east-1"
}

Add User to Tenant

POST /api/tenants/{tenantId}/users

Request:

{
  "userId": "u123",
  "role": "admin"
}

Get Tenant Configuration

GET /api/tenants/{tenantId}/config

Update Tenant Settings

PATCH /api/tenants/{tenantId}/settings

Query Tenant Usage

GET /api/tenants/{tenantId}/usage

👉 Interview Answer

I would expose APIs for tenant creation, user membership, tenant configuration, tenant settings, and usage tracking.

Every API must validate that the caller has permission to access the requested tenant.

5️⃣ Data Isolation Models

Option 1: Shared Database, Shared Tables

All tenants share the same tables.

orders (
  tenant_id VARCHAR,
  order_id VARCHAR,
  user_id VARCHAR,
  amount DECIMAL,
  PRIMARY KEY (tenant_id, order_id)
)

Pros

Lowest cost
Easy to operate
Efficient for many small tenants

Cons

Higher risk of data leakage
Harder tenant-level backup/restore
Noisy-neighbor risk

Option 2: Shared Database, Separate Schema

Each tenant has its own schema.

tenant_a.orders
tenant_b.orders

Pros

Better isolation
Easier tenant-level migration
Some customization possible

Cons

More operational complexity
Schema management harder at scale

Option 3: Separate Database per Tenant

Each tenant has its own database.

tenant_a_db
tenant_b_db

Pros

Strong isolation
Easier backup/restore
Better compliance story
Better large-tenant customization

Cons

Higher cost
More operational complexity
Harder to manage many small tenants

Comparison

Model	Isolation	Cost	Operational Complexity	Best For
Shared tables	Low/Medium	Low	Low	Many small tenants
Separate schema	Medium	Medium	Medium	Mid-size tenants
Separate DB	High	High	High	Enterprise tenants

👉 Interview Answer

There are three common data isolation models: shared tables with tenant_id, separate schema per tenant, and separate database per tenant.

For many small tenants, shared tables are cost-efficient.

For enterprise tenants with strict compliance requirements, separate databases provide stronger isolation.

6️⃣ Recommended Hybrid Model

Practical Design

Use a hybrid model:

Small tenants → shared DB + tenant_id
Large enterprise tenants → dedicated DB or cluster
Regulated tenants → dedicated environment

Why Hybrid?

Because one model does not fit all tenants.

Small tenants need cost efficiency.

Enterprise tenants may need:

Strong isolation
Custom limits
Dedicated capacity
Separate backup/restore
Compliance controls

👉 Interview Answer

In practice, I would use a hybrid tenancy model.

Most tenants can share infrastructure using tenant_id isolation.

Large or regulated tenants can be moved to dedicated databases or dedicated clusters.

This balances cost efficiency and isolation.

7️⃣ Tenant-aware Request Flow

Request Flow

Request received
→ Authenticate user
→ Resolve tenant context
→ Authorize user for tenant
→ Apply tenant configuration
→ Query data with tenant filter
→ Record tenant-scoped logs and metrics

Tenant Context

Every request should carry:

tenant_id
user_id
role
plan
region
feature_flags
quota_limits

Important Rule

Never trust tenant_id from request body alone.

Resolve tenant from:

Auth token
User membership
Subdomain
Organization selector
API key

👉 Interview Answer

Every request should establish tenant context early.

After authentication, the system resolves which tenant the user is acting under, verifies membership, loads tenant configuration, and enforces tenant-scoped authorization.

Tenant filters should be applied automatically, not manually in every query.

8️⃣ Authentication and Authorization

Authentication

Verifies identity.

Examples:

Username/password
SSO
SAML
OAuth
API key
Service account

Authorization

Verifies tenant access.

Example:

Can user u123 access tenant company_123?
Can user u123 perform admin action?

RBAC

Role-based access control:

owner
admin
member
viewer
billing_admin

ABAC

Attribute-based access control:

tenant.plan = enterprise
user.department = finance
resource.region = US

👉 Interview Answer

Authentication tells us who the user is.

Authorization tells us what tenant and resources they can access.

I would use RBAC for common tenant roles and ABAC for more advanced enterprise policies.

9️⃣ Data Access Layer

Problem

Developers may forget tenant filters.

Bad query:

SELECT * FROM orders WHERE order_id = 'o123';

Correct query:

SELECT * FROM orders
WHERE tenant_id = 't123'
AND order_id = 'o123';

Solution

Use tenant-aware data access layer.

Options:

Automatically inject tenant_id
Use row-level security
Use scoped repositories
Use tenant-specific database connection
Add query linting / tests

👉 Interview Answer

Cross-tenant data leakage is one of the biggest risks.

I would enforce tenant filtering in the data access layer, not rely on every developer to remember tenant_id.

Row-level security or tenant-scoped repositories can help reduce mistakes.

🔟 Configuration and Feature Isolation

Tenant Configuration

Examples:

{
  "tenantId": "t123",
  "timezone": "America/New_York",
  "locale": "en-US",
  "retentionDays": 365,
  "maxUsers": 1000,
  "features": {
    "advancedReporting": true,
    "ssoEnabled": true
  }
}

Feature Flags

Feature flags can be scoped by:

Tenant
Plan
Region
User group
Environment

Why Important?

Different tenants may have:

Different plans
Different compliance requirements
Different rollout schedules
Different enabled modules

👉 Interview Answer

Multi-tenant systems need tenant-specific configuration.

Feature flags, limits, retention policies, integrations, and compliance settings may differ by tenant.

These settings should be loaded as part of tenant context.

1️⃣1️⃣ Resource Quotas and Noisy Neighbor Control

Problem

One tenant can overload shared infrastructure.

Examples:

Too many API requests
Huge reports
Large file uploads
Expensive queries
High background job volume

Controls

Per-tenant rate limits
Per-tenant quotas
Query timeouts
Background job limits
Storage limits
Usage-based throttling
Dedicated queues for large tenants

Example

tenant_free_plan: 100 requests/min
tenant_enterprise: 10,000 requests/min

👉 Interview Answer

In shared infrastructure, noisy neighbors are a major risk.

I would enforce per-tenant rate limits, quotas, query limits, and background job limits.

Large tenants can be isolated into dedicated queues or clusters.

1️⃣2️⃣ Tenant-aware Background Jobs

Problem

Background jobs can also leak or overload tenant data.

Examples:

Report generation
Email notifications
Billing jobs
Data export
Cleanup jobs

Design

Every job should include:

{
  "tenantId": "t123",
  "jobType": "generate_report",
  "requestedBy": "u456"
}

Controls

Per-tenant job queues
Job concurrency limits
Tenant-aware retry
Tenant-level cancellation
Job audit logs

👉 Interview Answer

Background jobs must also be tenant-aware.

Every job should include tenant_id, enforce tenant permissions, and respect tenant quotas.

Otherwise, asynchronous processing can become a source of data leakage or noisy-neighbor issues.

1️⃣3️⃣ Billing and Usage Tracking

Usage Metrics

Track per tenant:

API calls
Active users
Storage usage
Compute usage
Data exports
Emails sent
Events processed
Feature usage

Billing Flow

Usage events emitted
→ Usage aggregation
→ Tenant invoice generated
→ Payment collected
→ Billing records stored

Important Rule

Billing data must be auditable.

👉 Interview Answer

A multi-tenant system should track usage by tenant.

Usage metrics support billing, quota enforcement, capacity planning, and customer reporting.

Billing records should be auditable and not depend only on volatile counters.

1️⃣4️⃣ Tenant Onboarding and Offboarding

Onboarding Flow

Create tenant
→ Create tenant config
→ Set plan and quotas
→ Create admin user
→ Provision resources if needed
→ Enable feature flags
→ Send welcome notification

Offboarding Flow

Disable tenant access
→ Stop background jobs
→ Export data if needed
→ Apply retention policy
→ Delete or archive tenant data
→ Remove integrations

Enterprise Onboarding

May include:

SSO setup
Dedicated database
Custom domain
Compliance review
Data residency configuration

👉 Interview Answer

Tenant lifecycle should be explicit.

Onboarding creates configuration, users, quotas, and resources.

Offboarding must disable access, stop jobs, handle exports, enforce retention, and delete or archive tenant data safely.

1️⃣5️⃣ Security and Compliance

Main Risks

Cross-tenant data leakage
Privilege escalation
Misconfigured tenant access
Shared cache leakage
Logs containing sensitive data
Incorrect data deletion
Over-permissive admin tools

Controls

Tenant-aware authorization
Row-level security
Strong audit logs
Encryption at rest and in transit
Tenant-scoped admin tools
Cache keys include tenant_id
Data retention and deletion workflows
Security tests for cross-tenant access

👉 Interview Answer

Security is the most important part of multi-tenancy.

Every layer must be tenant-aware: API, authorization, data access, cache, logs, metrics, admin tools, and background jobs.

Cache keys and logs must include tenant context to avoid cross-tenant leakage.

1️⃣6️⃣ Caching Strategy

Risk

Shared cache can leak data.

Bad cache key:

user:123:orders

Better cache key:

tenant:t123:user:123:orders

Cache Rules

Include tenant_id in cache keys
Separate cache namespaces for dedicated tenants
Apply tenant-specific TTLs
Avoid caching sensitive data if unnecessary
Invalidate by tenant

👉 Interview Answer

Caching must be tenant-aware.

Every cache key should include tenant_id.

Otherwise, two tenants with similar user IDs or resource IDs could accidentally read each other’s cached data.

1️⃣7️⃣ Observability

Tenant-level Metrics

Track:

API latency by tenant
Error rate by tenant
Request volume by tenant
Storage usage by tenant
Background job lag by tenant
Rate limit hits by tenant
Feature usage by tenant
Cost by tenant

Why Important?

Tenant-level observability helps with:

Debugging customer issues
Detecting noisy tenants
Billing
Capacity planning
SLA reporting
Security investigations

👉 Interview Answer

Observability should be tenant-aware.

I would tag logs, metrics, traces, and audit events with tenant_id.

This allows us to debug tenant-specific issues, detect noisy neighbors, calculate cost, and support enterprise SLAs.

1️⃣8️⃣ Scaling Patterns

Pattern 1: Shared Infrastructure for Small Tenants

Cost-efficient.

Pattern 2: Dedicated Infrastructure for Large Tenants

Better isolation and SLA.

Pattern 3: Shard by Tenant ID

shard = hash(tenant_id) % N

Pattern 4: Tenant-aware Rate Limiting

Protect shared resources.

Pattern 5: Control Plane / Data Plane Separation

Control plane: tenant config, billing, provisioning
Data plane: tenant requests, data processing

👉 Interview Answer

To scale multi-tenancy, I would shard by tenant_id, use shared infrastructure for small tenants, dedicated infrastructure for enterprise tenants, and enforce tenant-level rate limits.

Separating control plane from data plane also helps manage provisioning and runtime traffic cleanly.

1️⃣9️⃣ Failure Handling

Common Failures

Tenant config unavailable
Wrong tenant context
Cross-tenant query bug
Noisy tenant overloads shared DB
Background job processes wrong tenant
Cache key missing tenant_id
Tenant migration fails
Dedicated tenant database unavailable

Strategies

Fail closed if tenant context is missing
Tenant-aware tests
Row-level security
Per-tenant circuit breakers
Per-tenant rate limits
Tenant migration rollback
Audit logs for all admin actions
Dedicated tenant failover strategy

👉 Interview Answer

If tenant context is missing, the system should fail closed rather than guess.

Cross-tenant data access is a severe security incident.

I would use tenant-aware tests, row-level security, audit logs, and per-tenant isolation controls to reduce risk.

2️⃣0️⃣ Consistency Model

Stronger Consistency Needed For

Tenant membership
Authorization
Tenant configuration
Billing records
Data deletion
Admin actions
Security policies

Eventual Consistency Acceptable For

Usage dashboards
Analytics
Feature rollout propagation
Search indexing
Logs and metrics aggregation
Non-critical notifications

👉 Interview Answer

Tenant membership, authorization, billing, data deletion, and security policies need strong correctness.

Usage dashboards, analytics, search indexing, and feature rollout propagation can usually be eventually consistent.

2️⃣1️⃣ End-to-End Flow

Tenant Request Flow

Request arrives
→ Authenticate user
→ Resolve tenant context
→ Authorize user against tenant
→ Load tenant config and feature flags
→ Enforce quotas/rate limits
→ Query data with tenant isolation
→ Return response
→ Emit tenant-scoped logs and metrics

Tenant Onboarding Flow

Create tenant
→ Provision config and quotas
→ Create admin membership
→ Configure features
→ Provision dedicated resources if needed
→ Start usage tracking

Tenant Offboarding Flow

Disable tenant access
→ Stop jobs
→ Export/archive data
→ Apply retention policy
→ Delete tenant data
→ Record audit trail

Key Insight

Multi-tenant System is not just adding tenant_id to tables — it is an end-to-end isolation model across data, access, config, compute, cache, logs, and billing.

🧠 Staff-Level Answer (Final)

👉 Interview Answer (Full Version)

When designing a multi-tenant system, I think of it as a shared platform that serves many customers while preserving strong tenant isolation.

The most important principle is that tenant_id must be a first-class concept.

Every request should resolve tenant context after authentication, verify that the user belongs to that tenant, load tenant-specific configuration and feature flags, enforce quotas, and access data only within that tenant boundary.

For data isolation, there are three common models: shared tables with tenant_id, separate schema per tenant, and separate database per tenant.

Shared tables are cost-efficient for many small tenants, while separate databases provide stronger isolation for enterprise or regulated tenants.

In practice, I would use a hybrid model: shared infrastructure for most tenants, and dedicated databases or clusters for large or regulated tenants.

Authorization should be tenant-aware. Authentication tells us who the user is; authorization tells us which tenant and resources they can access.

I would use RBAC for common roles and ABAC for more advanced enterprise policies.

The data access layer should enforce tenant filters automatically, using tenant-scoped repositories, row-level security, or tenant-specific database connections.

I would not rely on developers manually adding tenant_id filters in every query.

Configuration, feature flags, rate limits, quotas, background jobs, caches, logs, metrics, and admin tools must all include tenant context.

To prevent noisy-neighbor problems, I would enforce per-tenant rate limits, storage limits, background job limits, query timeouts, and dedicated queues for large tenants.

Security is critical. Cache keys must include tenant_id, logs must be tenant-scoped, admin actions must be audited, and cross-tenant access should fail closed.

The main trade-offs are isolation, cost, operational complexity, scalability, and compliance.

Ultimately, the goal is to let many tenants share the same platform efficiently while ensuring that each tenant experiences the system as secure, isolated, reliable, and configurable.

⭐ Final Insight

Multi-tenant System 的核心不是简单给表加 tenant_id，而是让 data、auth、config、cache、compute、logs、billing 全链路都具备 tenant isolation。