Appearance
Performance & Scalability
Spec Source: Document 13 — Performance & Scalability | Last Updated: February 2026
Overview
This document defines the performance targets, caching strategy, database optimization approach, media handling pipeline, monitoring architecture, and scalability plan for the DoCurious platform. These are engineering benchmarks — they inform infrastructure decisions, set contractual SLAs, and provide the goalposts that every backend and frontend change is measured against.
DoCurious serves a school-heavy user base with predictable traffic patterns: weekday mornings ramp up with school logins, afternoons peak with post-school personal usage, and September/January onboarding windows create seasonal spikes. The architecture is designed to handle these patterns efficiently at every growth phase from 500-user beta through 500K+ mature platform.
STATUS: PARTIAL
The frontend implements route-level code splitting via React.lazy (50+ lazy-loaded page components), Suspense-based loading states with PageSkeleton, tree shaking through Vite 7's production build, and Tailwind CSS purging. No backend, CDN, caching layer, monitoring infrastructure, or load testing pipeline exists yet — the platform runs entirely against an in-memory mock API. Performance targets from the spec are documented here as the contracts the future backend must meet.
How It Works
Load Projections
DoCurious plans for five growth phases. Each phase determines infrastructure sizing, auto-scaling thresholds, and cost projections.
| Phase | Timeline | Total Users | DAU | Schools | Vendors | Avg req/s | Peak req/s |
|---|---|---|---|---|---|---|---|
| Beta | Months 1--3 | 500 | 100 | 5 | 10 | 5 | 20 |
| Launch | Months 3--6 | 5,000 | 1,000 | 25 | 50 | 50 | 200 |
| Growth | Months 6--12 | 25,000 | 5,000 | 100 | 200 | 250 | 1,000 |
| Scale | Year 2 | 100,000 | 20,000 | 500 | 500 | 1,000 | 5,000 |
| Mature | Year 3+ | 500,000+ | 100,000+ | 2,000+ | 2,000+ | 5,000 | 25,000 |
Traffic follows a school-driven daily pattern: near-zero overnight, morning ramp at 7--9 AM (school logins, teacher assignments), midday plateau at 9 AM--3 PM (steady school usage), afternoon peak at 3--6 PM (post-school personal usage, TR uploads), moderate evening at 6--10 PM (parent dashboards), then wind-down. Weekdays carry higher traffic than weekends. September and January see onboarding spikes from new school-year starts.
STATUS: PLANNED
Load projections are spec targets. No production traffic data exists yet.
Response Time Targets
Every API endpoint and page transition has latency contracts at three percentiles. These are the numbers the backend must meet and the monitoring system must enforce.
| Operation | p50 Target | p95 Target | p99 Target |
|---|---|---|---|
| Page load (initial) | < 1.5s | < 3.0s | < 5.0s |
| Page load (subsequent / SPA navigation) | < 300ms | < 800ms | < 1.5s |
| API response (simple read) | < 50ms | < 150ms | < 300ms |
| API response (complex query) | < 150ms | < 500ms | < 1.0s |
| Search query | < 100ms | < 200ms | < 500ms |
| Recommendation feed | < 150ms | < 300ms | < 600ms |
| Image upload (acknowledgment) | < 200ms | < 500ms | < 1.0s |
| Image processing (async) | < 5s | < 15s | < 30s |
| Notification delivery (in-app) | < 500ms | < 1.0s | < 2.0s |
| Email delivery (to ESP) | < 2s | < 5s | < 10s |
STATUS: PLANNED
These targets are engineering contracts. No backend exists to measure against them.
Web Vitals Targets
Frontend performance is measured against Core Web Vitals. These targets align with Google's "Good" thresholds and ensure the app feels fast on school Chromebooks and budget phones alike.
| Metric | Target | What It Measures |
|---|---|---|
| LCP (Largest Contentful Paint) | < 2.5s | When main content loads |
| FID (First Input Delay) | < 100ms | Responsiveness to first interaction |
| CLS (Cumulative Layout Shift) | < 0.1 | Visual stability — no layout jank |
| TTFB (Time to First Byte) | < 600ms | Server response time |
| FCP (First Contentful Paint) | < 1.8s | When first content appears on screen |
Availability Targets
| Metric | Target |
|---|---|
| Uptime | 99.9% (8.7 hours downtime/year max) |
| Planned maintenance window | < 2 hours/month, scheduled outside school hours |
| Mean time to detect (MTTD) | < 5 minutes |
| Mean time to recover (MTTR) | < 1 hour (critical), < 4 hours (non-critical) |
Throughput Targets
| Resource | Target |
|---|---|
| Concurrent users (per server) | 500 |
| WebSocket connections (if used) | 10,000 per node |
| File uploads (concurrent) | 100 per server |
| Background job processing | 1,000 jobs/minute |
| Email sends | 10,000/hour |
| Push notifications | 50,000/hour |
Caching Strategy
Cache Layers
The caching architecture has four tiers. Each request passes through all layers before reaching the database.
[Browser Cache] → [CDN] → [Application Cache (Redis)] → [Database]Browser cache. Static assets (JS, CSS, images) get Cache-Control: max-age=1year with content-hash filenames — Vite already generates hashed filenames in production builds. API responses use no-cache for dynamic data and short TTLs for semi-static content (categories, featured challenges). A Service Worker caches offline-capable assets.
CDN. All static assets are served via CDN (CloudFront, Cloudflare, or equivalent). User-uploaded media is served via CDN with signed URLs for access control. Geographic distribution reduces latency for users across different school districts. Cache invalidation uses versioned filenames for assets and explicit purge for media.
Application cache (Redis). Hot data lives in Redis with TTL-based expiration. This layer absorbs the vast majority of read traffic.
| Data | TTL | Invalidation Trigger |
|---|---|---|
| User session | 30 days | Logout, password change |
| User profile | 15 minutes | Profile update |
| Challenge detail | 1 hour | Challenge edit |
| Challenge quality scores | 24 hours | Daily recompute |
| Recommendation feeds | 1 hour | Recompute |
| Popular/trending feeds | 6 hours | Daily recompute |
| Category list | 24 hours | Category change |
| Search autocomplete | 6 hours | Index update |
| Notification count (unread) | Real-time | New notification / read |
| Feature flags | 5 minutes | Flag change |
| Rate limit counters | Per window | Automatic expiry |
Database query cache. Prepared statement caching, connection pooling via PgBouncer, and materialized views for expensive aggregations. Analytics views refresh every 15 minutes; leaderboards refresh every hour.
STATUS: PLANNED
No caching infrastructure exists. The frontend runs against in-memory mock data. Vite's content-hashed filenames are the only cache-related behavior currently in place.
Cache Warming
On deployment or cache flush, the system pre-warms critical data to avoid cold-cache latency spikes:
- Pre-warm category list
- Pre-warm featured content
- Pre-warm popular/trending feeds
- Pre-warm top 1,000 challenge details
- Stagger warming requests to avoid thundering herd
Cache Invalidation Patterns
| Pattern | When Used | Example |
|---|---|---|
| Write-through | Update cache on write | User profile, challenge detail |
| TTL-based | Let cache expire naturally | Feeds, scores |
| Event-driven | Invalidate on event | Notification count on new notification |
| Never cache | Sensitive data | Authentication tokens, PII lookups |
Database Optimization
Database Selection
Primary: PostgreSQL. ACID compliance for transactional data, full-text search for initial search implementation, JSON/JSONB support for flexible schemas, row-level security for data isolation, and mature replication and backup tooling. The local development database (docurious_prod) runs PostgreSQL 17.
Cache/Queue: Redis. Session storage, application cache, rate limiting, job queues, and real-time notification counts.
Future (at scale): Elasticsearch or Meilisearch for search when PostgreSQL FTS is insufficient, a data warehouse for analytics separate from the transactional DB, and read replicas for query distribution.
Indexing Strategy
Every table gets primary key, created_at, and foreign key indexes. Critical query indexes are designed around the actual access patterns:
| Table | Index | Supports |
|---|---|---|
challenges | (status, category_id, created_at) | Browse by category |
challenges | (vendor_id, status) | Vendor dashboard |
challenges | GIN index on search vector | Full-text search |
track_records | (user_id, status, created_at DESC) | My Track Records |
track_records | (challenge_id, status) | Challenge TR gallery |
track_records | (status, created_at) | Verification queue |
notifications | (user_id, read, created_at DESC) | Notification center |
assignments | (class_id, due_date, status) | Student assignments |
users | (email) UNIQUE | Login lookup |
community_members | (community_id, user_id) UNIQUE | Membership check |
events | (challenge_id, start_time) | Event calendar |
Partial indexes improve performance for common filtered queries: track_records WHERE status = 'pending_review' (verification queue), challenges WHERE status = 'active' (exclude archived from browse), notifications WHERE read = false (unread count).
Query Optimization Guidelines
- No N+1 queries — always use eager loading or joins for related data
- Cursor-based pagination for infinite scroll, offset-based for admin tables
- Maximum query time — 500ms hard limit, alert on > 200ms
- EXPLAIN ANALYZE on all new queries touching large tables
- No SELECT * — always specify columns
- Limit result sets — maximum 100 per page, 50 default
Connection Management
| Setting | Value |
|---|---|
| Connection pooling | PgBouncer (recommended) |
| Pool size | 20--50 connections per instance |
| Connection timeout | 5 seconds |
| Query timeout | 30 seconds (hard kill) |
| Idle connection cleanup | 10 minutes |
Data Partitioning (At Scale)
When tables exceed approximately 50 million rows, partition by month on created_at: audit logs, notifications, analytics events, search analytics. Track records may be partitioned by year if volume warrants it.
Image & Media Optimization
Image Pipeline
[User Upload] → [Validation] → [Virus Scan] → [EXIF Strip]
→ [Store Original] → [Generate Variants] → [CDN]Five image variants are generated asynchronously for every upload:
| Variant | Dimensions | Quality | Use |
|---|---|---|---|
| Thumbnail | 150 x 150 (crop) | 80% | Grid views, lists |
| Small | 400px wide | 85% | Card previews |
| Medium | 800px wide | 85% | Detail views |
| Large | 1600px wide | 90% | Full-screen view |
| Original | As uploaded | 100% | Download, export |
Processing uses WebP format with JPEG fallback. All images below the fold use lazy loading. Responsive images are served via srcset.
Upload Limits
| User Type | Max File Size | Max Files per TR | Daily Upload Limit |
|---|---|---|---|
| General user | 50 MB | 20 | 500 MB |
| Student (under 13) | 25 MB | 10 | 200 MB |
| Teacher | 50 MB | 20 | 500 MB |
| Vendor | 100 MB | 50 | 2 GB |
| Admin | 100 MB | Unlimited | Unlimited |
Video Handling
Videos are not hosted on DoCurious. Users provide links (YouTube, Vimeo, etc.) and the platform embeds via oEmbed or iframe. Thumbnails are extracted from the embed provider. Direct video upload with transcoding is a planned enhancement.
Storage Architecture
Cloud object storage (S3 or equivalent) with bucket structure: /{environment}/{content-type}/{user-id}/{file-id}. Content types: avatars, challenge-covers, track-record-media, exports. Lifecycle policies delete orphaned files after 30 days. Cross-region replication provides disaster recovery.
STATUS: PLANNED
The frontend MediaUpload component handles client-side file selection and preview. No server-side image processing, CDN delivery, or storage backend exists.
Frontend Performance Budget
Bundle Strategy
STATUS: BUILT
The frontend implements route-level code splitting with React.lazy for 50+ page components, Suspense boundaries with PageSkeleton fallbacks, and Vite 7 production builds with tree shaking and Tailwind CSS purging. Auth pages (Login, Register, ForgotPassword, ResetPassword, VerifyEmail) are eagerly loaded for fast initial render; everything else is lazy.
Code splitting by route ensures each page loads only its dependencies. The router at src/routes/index.tsx lazy-loads every page component except the five auth pages that are kept eager for fast initial login.
| Resource | Budget |
|---|---|
| HTML document | < 50 KB |
| CSS (total) | < 100 KB gzipped |
| JS (initial route) | < 200 KB gzipped |
| Images (above fold) | < 500 KB total |
| Web fonts | < 100 KB total |
| Total page weight (initial) | < 1 MB |
| HTTP requests (initial) | < 30 |
Target total JS across all routes: < 1 MB gzipped.
Asset Optimization
- Images: WebP with JPEG fallback, responsive
srcset - Fonts: System font stack preferred; custom fonts use
font-display: swap - CSS: Critical CSS inlined, remainder async loaded
- Icons: SVG sprite or icon font (no individual image requests)
- Preload: Critical resources via
<link rel="preload"> - Prefetch: Likely next navigation via
<link rel="prefetch">
Rendering Strategy
| Page Type | Strategy | Examples |
|---|---|---|
| Landing, About, Privacy, Terms | Static generation | Landing.tsx, PrivacyPolicy.tsx, TermsOfService.tsx |
| Challenge browse, category pages | Incremental Static Regeneration | Explore.tsx, ExploreCategoryView.tsx |
| Dashboard, notifications, admin | Client-side rendering | Dashboard.tsx, Notifications.tsx |
| All pages (initial load) | Server-side rendering for SEO | Full SSR pass |
STATUS: PARTIAL
The current build is a pure client-side SPA (Vite + React). SSR, static generation, and ISR are planned architectural changes that will require a framework migration (e.g., to Next.js or Remix) or a custom SSR layer.
Background Job Architecture
Job Categories
| Category | Queue | Priority | Concurrency |
|---|---|---|---|
| Email delivery | email | Medium | 10 workers |
| Push notification | push | Medium | 5 workers |
| Image processing | media | Low | 5 workers |
| Search index update | search | Low | 3 workers |
| Recommendation recompute | recommendations | Low | 2 workers |
| Analytics aggregation | analytics | Low | 2 workers |
| Data export generation | exports | Low | 2 workers |
| Scheduled notifications | scheduler | Medium | 3 workers |
| Cleanup (expired tokens, etc.) | maintenance | Low | 1 worker |
Processing Requirements
- At-least-once delivery with idempotent job design
- Dead letter queue for failed jobs after 3 retries
- Exponential backoff on retry: 1s, 10s, 60s
- Job timeout: 5 minutes default, 30 minutes for exports and analytics
- Alert on: queue depth > 1,000, failure rate > 5%, processing latency > 5 minutes
Scheduled Jobs
| Job | Schedule | Purpose |
|---|---|---|
| Daily analytics aggregation | 2:00 AM UTC | Compute daily metrics |
| Challenge quality score recompute | 3:00 AM UTC | Update quality/trending scores |
| Recommendation feed generation | Every hour | Refresh user feeds |
| Popular/trending feed update | 4:00 AM UTC | Update global feeds |
| Digest email generation | 6:00 AM per timezone | Weekly/daily digests |
| Streak check | 12:01 AM per timezone | Evaluate streak status |
| Expired token cleanup | 1:00 AM UTC | Remove expired sessions, reset tokens |
| Orphaned file cleanup | 5:00 AM UTC (weekly) | Remove unattached uploads |
| Backup verification | 6:00 AM UTC (weekly) | Verify backup integrity |
| Leaderboard recompute | Every hour | Update leaderboard rankings |
| School health score update | 4:00 AM UTC (daily) | Recompute school health |
| Data retention cleanup | 3:00 AM UTC (monthly) | Remove expired data per retention policy |
STATUS: PLANNED
No background job infrastructure exists. All scheduled computations will need a queue system (SQS, BullMQ, or equivalent) and worker processes.
Monitoring & Alerting
Application Monitoring
Metrics to track per endpoint: request rate, response time (p50, p95, p99), error rate (4xx, 5xx), active users (real-time), database query time per query pattern, cache hit/miss ratio, and background job queue depth and latency.
Recommended tooling: APM via Datadog, New Relic, or open-source Grafana + Prometheus. Error tracking via Sentry. Log aggregation via ELK stack, Datadog Logs, or CloudWatch.
Infrastructure Monitoring
| Metric | Alert Threshold |
|---|---|
| CPU utilization | 80% |
| Memory utilization | 85% |
| Disk utilization | 80% |
| Database connections (active/idle/waiting) | 90% pool used |
| Redis memory usage | 80% |
| SSL certificate expiry | < 14 days |
Alerting Rules
| Alert | Condition | Severity | Channel |
|---|---|---|---|
| High error rate | 5xx rate > 1% for 5 min | Critical | Pager + Slack |
| Slow responses | p95 > 2s for 10 min | High | Slack |
| Database connection exhaustion | > 90% pool used | Critical | Pager + Slack |
| Disk space low | > 80% used | High | Slack |
| Job queue backing up | Depth > 1,000 for 15 min | High | Slack |
| Certificate expiring | < 14 days | Medium | |
| Memory pressure | > 90% for 5 min | High | Slack |
| Zero traffic | No requests for 5 min | Critical | Pager |
Operations Dashboards
Three dashboards are specified:
Operations Dashboard: Request rate, error rate, response time (real-time), active users, system health (CPU, memory, disk), recent deployments marked on timeline.
Database Dashboard: Query rate and latency, connection pool status, slow query log, replication lag.
Background Jobs Dashboard: Queue depths by category, processing rate, failure rate, recent failures with details.
STATUS: PLANNED
No monitoring infrastructure exists. The frontend has a dev-only Debug Panel (Ctrl+Shift+D) with a Network Log tab that records mock API calls — this is a development tool, not production monitoring.
Scalability Architecture
Horizontal Scaling
| Component | Scaling Approach | Trigger |
|---|---|---|
| Web/API servers | Auto-scale based on CPU / request rate | CPU > 70% or requests > 80% capacity |
| Background workers | Scale by queue depth | Queue depth > 500 |
| Database (reads) | Add read replicas | Read latency > 100ms or CPU > 70% |
| Database (writes) | Vertical scale first, partition later | Write latency > 50ms |
| Redis | Cluster mode at scale | Memory > 80% |
| File storage | Managed service (auto-scales) | N/A |
| CDN | Managed service (auto-scales) | N/A |
Architecture Scaling Tiers
Small (up to 5K users): Single application server (or 2 for redundancy), single PostgreSQL instance, single Redis instance, managed file storage (S3), CDN for static assets.
Medium (5K--50K users): Auto-scaling application servers (2--5 instances), PostgreSQL with read replica, Redis with persistence, dedicated background worker instances, Elasticsearch for search replacing PostgreSQL FTS.
Large (50K--500K users): 5--20 application server instances, PostgreSQL primary + multiple read replicas, Redis cluster, dedicated analytics data warehouse, dedicated search cluster, microservice extraction for high-traffic paths (notifications, recommendations).
Database Read/Write Split
At the Medium tier and above: write operations go to the primary database, read operations go to read replica(s). Application-level routing handles the split. Replication lag is monitored with alerts if it exceeds 1 second. Critical reads (authentication, authorization) always hit the primary.
API Design Standards
Response Format
All API responses follow a consistent envelope:
Success:
json
{
"data": { ... },
"meta": {
"page": 1,
"per_page": 20,
"total": 150,
"total_pages": 8
}
}Error:
json
{
"error": {
"code": "VALIDATION_ERROR",
"message": "Human-readable description",
"details": [
{ "field": "email", "message": "Invalid email format" }
]
}
}Pagination
Cursor-based (for feeds, infinite scroll): ?cursor={opaque_token}&limit=20. Response includes next_cursor (null if no more results). Default limit 20, max 100.
Offset-based (for admin tables, finite lists): ?page=1&per_page=20. Response includes total count and total pages. Default per_page 20, max 100.
Rate Limiting
Every response includes rate limit headers:
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704067200Load Testing
Test Scenarios
| Scenario | Target | Tool |
|---|---|---|
| Sustained load | Handle expected daily traffic for 1 hour | k6, Locust, or Artillery |
| Peak load | Handle 3x average load for 15 min | Same |
| Spike test | Handle sudden 10x traffic for 5 min | Same |
| Soak test | Handle average load for 24 hours (memory leaks) | Same |
| School onboarding surge | 500 students login simultaneously | Same |
| Media upload burst | 100 concurrent uploads | Same |
Test Frequency
- Before major releases: full suite
- Weekly: sustained load test (automated)
- Monthly: peak and spike tests
- Quarterly: soak test
Test Environment
Staging environment matching production topology, synthetic data at 2x expected production volume, production-like database size. Results are compared against the response time and throughput targets defined above.
STATUS: PLANNED
No load testing infrastructure or test scripts exist.
Design Decisions
Why cursor-based pagination for feeds? Offset-based pagination breaks when new content is inserted — users see duplicates or miss items. Cursor-based pagination provides stable, consistent results for infinite scroll feeds. Offset-based is kept for admin tables where total count and page navigation are needed.
Why no direct video hosting? Video transcoding, storage, and delivery are expensive and complex. YouTube and Vimeo handle it well. By embedding rather than hosting, DoCurious avoids significant infrastructure cost and complexity while still allowing video evidence in Track Records. Direct upload is a future enhancement.
Why Redis for sessions instead of JWTs? Server-side sessions in Redis allow immediate invalidation on logout or password change. JWTs cannot be revoked before expiry without a blocklist — which is effectively reimplementing server-side sessions. Redis sessions are simpler and more secure for the security model DoCurious needs.
Why PostgreSQL FTS before Elasticsearch? PostgreSQL's built-in full-text search is sufficient for the Launch and early Growth phases. It avoids the operational overhead of a separate search cluster. The migration to Elasticsearch or Meilisearch is planned for when FTS query performance degrades at scale.
Why eager-load auth pages but lazy-load everything else? Login, registration, and password reset are the first screens users see. Lazy-loading them would add a visible loading spinner to the very first interaction. All other pages benefit from code splitting because users only visit a subset of the 55+ pages in any session.
Technical Implementation
Current Frontend Performance Stack
| Layer | Implementation | File |
|---|---|---|
| Route code splitting | React.lazy + Suspense for 50+ page components | src/routes/index.tsx |
| Loading fallbacks | PageSkeleton component during lazy load | src/components/common/PageSkeleton.tsx |
| Error boundaries | ErrorBoundary wrapping route segments | src/components/common/ErrorBoundary.tsx |
| Build tooling | Vite 7 with tree shaking, minification, content hashing | vite.config.ts |
| CSS optimization | Tailwind CSS 4 with automatic purging of unused classes | @tailwindcss/vite plugin |
| Component memoization | useMemo / useCallback / memo used across 52 files | Various components |
| Dev performance inspection | Debug Panel with Network Log tab (dev-only) | src/components/debug/ |
| Type checking | TypeScript strict mode for compile-time safety | tsconfig.json |
| Testing | Vitest with jsdom for unit/component tests | vitest config |
What Needs Building
| Feature | Priority | Dependencies |
|---|---|---|
| Backend API with response time enforcement | Critical | Express/Fastify server, PostgreSQL |
| Redis caching layer with TTL strategy | Critical | Redis instance, backend API |
| CDN configuration for static + media assets | Critical | Cloud provider (AWS/Cloudflare) |
| Image processing pipeline (variant generation) | High | Sharp/Pillow, object storage, background workers |
| Connection pooling (PgBouncer) | High | PostgreSQL deployment |
| Database indexing per spec strategy | High | Backend ORM / migration tooling |
| Background job queue (BullMQ, SQS) | High | Redis or SQS, worker processes |
| APM + error tracking (Sentry, Datadog) | High | Production deployment |
| Rate limiting middleware | Medium | Redis, API gateway |
| Load testing scripts and CI pipeline | Medium | k6 or Locust, staging environment |
| SSR / static generation for SEO pages | Medium | Framework migration or custom SSR |
| Service Worker for offline assets | Low | Frontend PWA setup |
| Elasticsearch migration for search | Low | Scale-phase trigger |
| Data partitioning for large tables | Low | 50M+ row threshold |
Related Features
- Explore & Discovery — Search and recommendation performance targets (< 100ms search, < 150ms recommendations) drive database indexing and caching strategy
- Challenges — Challenge browse, detail, and gallery pages are the highest-traffic read paths that caching and CDN must optimize
- Track Records — TR media uploads drive the image processing pipeline, storage architecture, and upload limit enforcement
- Notifications — Notification delivery infrastructure (50K push/hour, 10K email/hour) requires background job queues and real-time cache invalidation
- Gamification — Leaderboard recomputation, XP aggregation, and streak checks are scheduled background jobs with materialized view dependencies
- School Administration — School onboarding surges (500 simultaneous student logins) are a key load testing scenario; school health scores are daily scheduled recomputes
- Vendor — Vendor analytics dashboards rely on materialized views refreshed every 15 minutes
- Accounts — Authentication, session management, and rate limiting are performance-critical paths that always hit the primary database