🎯 Design Multi-tenant System
1️⃣ Core Framework
When discussing Multi-tenant System design, I frame it as:
- Tenant model and isolation level
- Authentication and authorization
- Data partitioning strategy
- Tenant-aware application design
- Configuration and feature isolation
- Resource quotas and noisy-neighbor control
- Security, compliance, and auditing
- Trade-offs: isolation vs cost vs scalability
2️⃣ Core Requirements
Functional Requirements
- Support multiple tenants/customers
- Each tenant has users, roles, and permissions
- Each tenant’s data must be isolated
- Support tenant-specific configuration
- Support tenant-specific feature flags
- Support tenant-level billing and usage tracking
- Support tenant onboarding/offboarding
- Support admin operations per tenant
- Support audit logs per tenant
Non-functional Requirements
- Strong data isolation
- High availability
- Scalable tenant growth
- Secure access control
- Good operational visibility
- Cost-efficient shared infrastructure
- Prevent noisy-neighbor impact
- Support compliance and audit requirements
👉 Interview Answer
A multi-tenant system serves multiple customers on shared infrastructure.
The most important design goal is tenant isolation.
Each tenant’s data, permissions, configuration, usage, and operational visibility must be separated, even if the infrastructure is shared.
3️⃣ Core Concepts
Tenant
A tenant is a customer or organization.
Example:
tenant_id = company_123
Tenant User
A user belongs to one or more tenants.
user_id = u456
tenant_id = company_123
role = admin
Tenant Isolation
Isolation can apply to:
- Data
- Authentication
- Authorization
- Configuration
- Compute resources
- Storage resources
- Logs and metrics
- Billing
👉 Interview Answer
I would treat tenant_id as a first-class concept.
Every request, every data record, every authorization check, and every log entry should be tenant-aware.
This reduces the risk of cross-tenant data leakage.
4️⃣ Main APIs
Create Tenant
POST /api/tenants
Request:
{
"name": "Acme Corp",
"plan": "enterprise",
"region": "us-east-1"
}
Add User to Tenant
POST /api/tenants/{tenantId}/users
Request:
{
"userId": "u123",
"role": "admin"
}
Get Tenant Configuration
GET /api/tenants/{tenantId}/config
Update Tenant Settings
PATCH /api/tenants/{tenantId}/settings
Query Tenant Usage
GET /api/tenants/{tenantId}/usage
👉 Interview Answer
I would expose APIs for tenant creation, user membership, tenant configuration, tenant settings, and usage tracking.
Every API must validate that the caller has permission to access the requested tenant.
5️⃣ Data Isolation Models
Option 1: Shared Database, Shared Tables
All tenants share the same tables.
orders (
tenant_id VARCHAR,
order_id VARCHAR,
user_id VARCHAR,
amount DECIMAL,
PRIMARY KEY (tenant_id, order_id)
)
Pros
- Lowest cost
- Easy to operate
- Efficient for many small tenants
Cons
- Higher risk of data leakage
- Harder tenant-level backup/restore
- Noisy-neighbor risk
Option 2: Shared Database, Separate Schema
Each tenant has its own schema.
tenant_a.orders
tenant_b.orders
Pros
- Better isolation
- Easier tenant-level migration
- Some customization possible
Cons
- More operational complexity
- Schema management harder at scale
Option 3: Separate Database per Tenant
Each tenant has its own database.
tenant_a_db
tenant_b_db
Pros
- Strong isolation
- Easier backup/restore
- Better compliance story
- Better large-tenant customization
Cons
- Higher cost
- More operational complexity
- Harder to manage many small tenants
Comparison
| Model | Isolation | Cost | Operational Complexity | Best For |
|---|---|---|---|---|
| Shared tables | Low/Medium | Low | Low | Many small tenants |
| Separate schema | Medium | Medium | Medium | Mid-size tenants |
| Separate DB | High | High | High | Enterprise tenants |
👉 Interview Answer
There are three common data isolation models: shared tables with tenant_id, separate schema per tenant, and separate database per tenant.
For many small tenants, shared tables are cost-efficient.
For enterprise tenants with strict compliance requirements, separate databases provide stronger isolation.
6️⃣ Recommended Hybrid Model
Practical Design
Use a hybrid model:
Small tenants → shared DB + tenant_id
Large enterprise tenants → dedicated DB or cluster
Regulated tenants → dedicated environment
Why Hybrid?
Because one model does not fit all tenants.
Small tenants need cost efficiency.
Enterprise tenants may need:
- Strong isolation
- Custom limits
- Dedicated capacity
- Separate backup/restore
- Compliance controls
👉 Interview Answer
In practice, I would use a hybrid tenancy model.
Most tenants can share infrastructure using tenant_id isolation.
Large or regulated tenants can be moved to dedicated databases or dedicated clusters.
This balances cost efficiency and isolation.
7️⃣ Tenant-aware Request Flow
Request Flow
Request received
→ Authenticate user
→ Resolve tenant context
→ Authorize user for tenant
→ Apply tenant configuration
→ Query data with tenant filter
→ Record tenant-scoped logs and metrics
Tenant Context
Every request should carry:
tenant_id
user_id
role
plan
region
feature_flags
quota_limits
Important Rule
Never trust tenant_id from request body alone.
Resolve tenant from:
- Auth token
- User membership
- Subdomain
- Organization selector
- API key
👉 Interview Answer
Every request should establish tenant context early.
After authentication, the system resolves which tenant the user is acting under, verifies membership, loads tenant configuration, and enforces tenant-scoped authorization.
Tenant filters should be applied automatically, not manually in every query.
8️⃣ Authentication and Authorization
Authentication
Verifies identity.
Examples:
- Username/password
- SSO
- SAML
- OAuth
- API key
- Service account
Authorization
Verifies tenant access.
Example:
Can user u123 access tenant company_123?
Can user u123 perform admin action?
RBAC
Role-based access control:
owner
admin
member
viewer
billing_admin
ABAC
Attribute-based access control:
tenant.plan = enterprise
user.department = finance
resource.region = US
👉 Interview Answer
Authentication tells us who the user is.
Authorization tells us what tenant and resources they can access.
I would use RBAC for common tenant roles and ABAC for more advanced enterprise policies.
9️⃣ Data Access Layer
Problem
Developers may forget tenant filters.
Bad query:
SELECT * FROM orders WHERE order_id = 'o123';
Correct query:
SELECT * FROM orders
WHERE tenant_id = 't123'
AND order_id = 'o123';
Solution
Use tenant-aware data access layer.
Options:
- Automatically inject tenant_id
- Use row-level security
- Use scoped repositories
- Use tenant-specific database connection
- Add query linting / tests
👉 Interview Answer
Cross-tenant data leakage is one of the biggest risks.
I would enforce tenant filtering in the data access layer, not rely on every developer to remember tenant_id.
Row-level security or tenant-scoped repositories can help reduce mistakes.
🔟 Configuration and Feature Isolation
Tenant Configuration
Examples:
{
"tenantId": "t123",
"timezone": "America/New_York",
"locale": "en-US",
"retentionDays": 365,
"maxUsers": 1000,
"features": {
"advancedReporting": true,
"ssoEnabled": true
}
}
Feature Flags
Feature flags can be scoped by:
- Tenant
- Plan
- Region
- User group
- Environment
Why Important?
Different tenants may have:
- Different plans
- Different compliance requirements
- Different rollout schedules
- Different enabled modules
👉 Interview Answer
Multi-tenant systems need tenant-specific configuration.
Feature flags, limits, retention policies, integrations, and compliance settings may differ by tenant.
These settings should be loaded as part of tenant context.
1️⃣1️⃣ Resource Quotas and Noisy Neighbor Control
Problem
One tenant can overload shared infrastructure.
Examples:
- Too many API requests
- Huge reports
- Large file uploads
- Expensive queries
- High background job volume
Controls
- Per-tenant rate limits
- Per-tenant quotas
- Query timeouts
- Background job limits
- Storage limits
- Usage-based throttling
- Dedicated queues for large tenants
Example
tenant_free_plan: 100 requests/min
tenant_enterprise: 10,000 requests/min
👉 Interview Answer
In shared infrastructure, noisy neighbors are a major risk.
I would enforce per-tenant rate limits, quotas, query limits, and background job limits.
Large tenants can be isolated into dedicated queues or clusters.
1️⃣2️⃣ Tenant-aware Background Jobs
Problem
Background jobs can also leak or overload tenant data.
Examples:
- Report generation
- Email notifications
- Billing jobs
- Data export
- Cleanup jobs
Design
Every job should include:
{
"tenantId": "t123",
"jobType": "generate_report",
"requestedBy": "u456"
}
Controls
- Per-tenant job queues
- Job concurrency limits
- Tenant-aware retry
- Tenant-level cancellation
- Job audit logs
👉 Interview Answer
Background jobs must also be tenant-aware.
Every job should include tenant_id, enforce tenant permissions, and respect tenant quotas.
Otherwise, asynchronous processing can become a source of data leakage or noisy-neighbor issues.
1️⃣3️⃣ Billing and Usage Tracking
Usage Metrics
Track per tenant:
- API calls
- Active users
- Storage usage
- Compute usage
- Data exports
- Emails sent
- Events processed
- Feature usage
Billing Flow
Usage events emitted
→ Usage aggregation
→ Tenant invoice generated
→ Payment collected
→ Billing records stored
Important Rule
Billing data must be auditable.
👉 Interview Answer
A multi-tenant system should track usage by tenant.
Usage metrics support billing, quota enforcement, capacity planning, and customer reporting.
Billing records should be auditable and not depend only on volatile counters.
1️⃣4️⃣ Tenant Onboarding and Offboarding
Onboarding Flow
Create tenant
→ Create tenant config
→ Set plan and quotas
→ Create admin user
→ Provision resources if needed
→ Enable feature flags
→ Send welcome notification
Offboarding Flow
Disable tenant access
→ Stop background jobs
→ Export data if needed
→ Apply retention policy
→ Delete or archive tenant data
→ Remove integrations
Enterprise Onboarding
May include:
- SSO setup
- Dedicated database
- Custom domain
- Compliance review
- Data residency configuration
👉 Interview Answer
Tenant lifecycle should be explicit.
Onboarding creates configuration, users, quotas, and resources.
Offboarding must disable access, stop jobs, handle exports, enforce retention, and delete or archive tenant data safely.
1️⃣5️⃣ Security and Compliance
Main Risks
- Cross-tenant data leakage
- Privilege escalation
- Misconfigured tenant access
- Shared cache leakage
- Logs containing sensitive data
- Incorrect data deletion
- Over-permissive admin tools
Controls
- Tenant-aware authorization
- Row-level security
- Strong audit logs
- Encryption at rest and in transit
- Tenant-scoped admin tools
- Cache keys include tenant_id
- Data retention and deletion workflows
- Security tests for cross-tenant access
👉 Interview Answer
Security is the most important part of multi-tenancy.
Every layer must be tenant-aware: API, authorization, data access, cache, logs, metrics, admin tools, and background jobs.
Cache keys and logs must include tenant context to avoid cross-tenant leakage.
1️⃣6️⃣ Caching Strategy
Risk
Shared cache can leak data.
Bad cache key:
user:123:orders
Better cache key:
tenant:t123:user:123:orders
Cache Rules
- Include tenant_id in cache keys
- Separate cache namespaces for dedicated tenants
- Apply tenant-specific TTLs
- Avoid caching sensitive data if unnecessary
- Invalidate by tenant
👉 Interview Answer
Caching must be tenant-aware.
Every cache key should include tenant_id.
Otherwise, two tenants with similar user IDs or resource IDs could accidentally read each other’s cached data.
1️⃣7️⃣ Observability
Tenant-level Metrics
Track:
- API latency by tenant
- Error rate by tenant
- Request volume by tenant
- Storage usage by tenant
- Background job lag by tenant
- Rate limit hits by tenant
- Feature usage by tenant
- Cost by tenant
Why Important?
Tenant-level observability helps with:
- Debugging customer issues
- Detecting noisy tenants
- Billing
- Capacity planning
- SLA reporting
- Security investigations
👉 Interview Answer
Observability should be tenant-aware.
I would tag logs, metrics, traces, and audit events with tenant_id.
This allows us to debug tenant-specific issues, detect noisy neighbors, calculate cost, and support enterprise SLAs.
1️⃣8️⃣ Scaling Patterns
Pattern 1: Shared Infrastructure for Small Tenants
Cost-efficient.
Pattern 2: Dedicated Infrastructure for Large Tenants
Better isolation and SLA.
Pattern 3: Shard by Tenant ID
shard = hash(tenant_id) % N
Pattern 4: Tenant-aware Rate Limiting
Protect shared resources.
Pattern 5: Control Plane / Data Plane Separation
Control plane: tenant config, billing, provisioning
Data plane: tenant requests, data processing
👉 Interview Answer
To scale multi-tenancy, I would shard by tenant_id, use shared infrastructure for small tenants, dedicated infrastructure for enterprise tenants, and enforce tenant-level rate limits.
Separating control plane from data plane also helps manage provisioning and runtime traffic cleanly.
1️⃣9️⃣ Failure Handling
Common Failures
- Tenant config unavailable
- Wrong tenant context
- Cross-tenant query bug
- Noisy tenant overloads shared DB
- Background job processes wrong tenant
- Cache key missing tenant_id
- Tenant migration fails
- Dedicated tenant database unavailable
Strategies
- Fail closed if tenant context is missing
- Tenant-aware tests
- Row-level security
- Per-tenant circuit breakers
- Per-tenant rate limits
- Tenant migration rollback
- Audit logs for all admin actions
- Dedicated tenant failover strategy
👉 Interview Answer
If tenant context is missing, the system should fail closed rather than guess.
Cross-tenant data access is a severe security incident.
I would use tenant-aware tests, row-level security, audit logs, and per-tenant isolation controls to reduce risk.
2️⃣0️⃣ Consistency Model
Stronger Consistency Needed For
- Tenant membership
- Authorization
- Tenant configuration
- Billing records
- Data deletion
- Admin actions
- Security policies
Eventual Consistency Acceptable For
- Usage dashboards
- Analytics
- Feature rollout propagation
- Search indexing
- Logs and metrics aggregation
- Non-critical notifications
👉 Interview Answer
Tenant membership, authorization, billing, data deletion, and security policies need strong correctness.
Usage dashboards, analytics, search indexing, and feature rollout propagation can usually be eventually consistent.
2️⃣1️⃣ End-to-End Flow
Tenant Request Flow
Request arrives
→ Authenticate user
→ Resolve tenant context
→ Authorize user against tenant
→ Load tenant config and feature flags
→ Enforce quotas/rate limits
→ Query data with tenant isolation
→ Return response
→ Emit tenant-scoped logs and metrics
Tenant Onboarding Flow
Create tenant
→ Provision config and quotas
→ Create admin membership
→ Configure features
→ Provision dedicated resources if needed
→ Start usage tracking
Tenant Offboarding Flow
Disable tenant access
→ Stop jobs
→ Export/archive data
→ Apply retention policy
→ Delete tenant data
→ Record audit trail
Key Insight
Multi-tenant System is not just adding tenant_id to tables — it is an end-to-end isolation model across data, access, config, compute, cache, logs, and billing.
🧠 Staff-Level Answer (Final)
👉 Interview Answer (Full Version)
When designing a multi-tenant system, I think of it as a shared platform that serves many customers while preserving strong tenant isolation.
The most important principle is that tenant_id must be a first-class concept.
Every request should resolve tenant context after authentication, verify that the user belongs to that tenant, load tenant-specific configuration and feature flags, enforce quotas, and access data only within that tenant boundary.
For data isolation, there are three common models: shared tables with tenant_id, separate schema per tenant, and separate database per tenant.
Shared tables are cost-efficient for many small tenants, while separate databases provide stronger isolation for enterprise or regulated tenants.
In practice, I would use a hybrid model: shared infrastructure for most tenants, and dedicated databases or clusters for large or regulated tenants.
Authorization should be tenant-aware. Authentication tells us who the user is; authorization tells us which tenant and resources they can access.
I would use RBAC for common roles and ABAC for more advanced enterprise policies.
The data access layer should enforce tenant filters automatically, using tenant-scoped repositories, row-level security, or tenant-specific database connections.
I would not rely on developers manually adding tenant_id filters in every query.
Configuration, feature flags, rate limits, quotas, background jobs, caches, logs, metrics, and admin tools must all include tenant context.
To prevent noisy-neighbor problems, I would enforce per-tenant rate limits, storage limits, background job limits, query timeouts, and dedicated queues for large tenants.
Security is critical. Cache keys must include tenant_id, logs must be tenant-scoped, admin actions must be audited, and cross-tenant access should fail closed.
The main trade-offs are isolation, cost, operational complexity, scalability, and compliance.
Ultimately, the goal is to let many tenants share the same platform efficiently while ensuring that each tenant experiences the system as secure, isolated, reliable, and configurable.
⭐ Final Insight
Multi-tenant System 的核心不是简单给表加 tenant_id, 而是让 data、auth、config、cache、compute、logs、billing 全链路都具备 tenant isolation。
Implement