Skip to content

Performance & Scalability

Spec Source: Document 13 — Performance & Scalability | Last Updated: February 2026

Overview

This document defines the performance targets, caching strategy, database optimization approach, media handling pipeline, monitoring architecture, and scalability plan for the DoCurious platform. These are engineering benchmarks — they inform infrastructure decisions, set contractual SLAs, and provide the goalposts that every backend and frontend change is measured against.

DoCurious serves a school-heavy user base with predictable traffic patterns: weekday mornings ramp up with school logins, afternoons peak with post-school personal usage, and September/January onboarding windows create seasonal spikes. The architecture is designed to handle these patterns efficiently at every growth phase from 500-user beta through 500K+ mature platform.

STATUS: PARTIAL

The frontend implements route-level code splitting via React.lazy (50+ lazy-loaded page components), Suspense-based loading states with PageSkeleton, tree shaking through Vite 7's production build, and Tailwind CSS purging. No backend, CDN, caching layer, monitoring infrastructure, or load testing pipeline exists yet — the platform runs entirely against an in-memory mock API. Performance targets from the spec are documented here as the contracts the future backend must meet.


How It Works

Load Projections

DoCurious plans for five growth phases. Each phase determines infrastructure sizing, auto-scaling thresholds, and cost projections.

PhaseTimelineTotal UsersDAUSchoolsVendorsAvg req/sPeak req/s
BetaMonths 1--3500100510520
LaunchMonths 3--65,0001,000255050200
GrowthMonths 6--1225,0005,0001002002501,000
ScaleYear 2100,00020,0005005001,0005,000
MatureYear 3+500,000+100,000+2,000+2,000+5,00025,000

Traffic follows a school-driven daily pattern: near-zero overnight, morning ramp at 7--9 AM (school logins, teacher assignments), midday plateau at 9 AM--3 PM (steady school usage), afternoon peak at 3--6 PM (post-school personal usage, TR uploads), moderate evening at 6--10 PM (parent dashboards), then wind-down. Weekdays carry higher traffic than weekends. September and January see onboarding spikes from new school-year starts.

STATUS: PLANNED

Load projections are spec targets. No production traffic data exists yet.

Response Time Targets

Every API endpoint and page transition has latency contracts at three percentiles. These are the numbers the backend must meet and the monitoring system must enforce.

Operationp50 Targetp95 Targetp99 Target
Page load (initial)< 1.5s< 3.0s< 5.0s
Page load (subsequent / SPA navigation)< 300ms< 800ms< 1.5s
API response (simple read)< 50ms< 150ms< 300ms
API response (complex query)< 150ms< 500ms< 1.0s
Search query< 100ms< 200ms< 500ms
Recommendation feed< 150ms< 300ms< 600ms
Image upload (acknowledgment)< 200ms< 500ms< 1.0s
Image processing (async)< 5s< 15s< 30s
Notification delivery (in-app)< 500ms< 1.0s< 2.0s
Email delivery (to ESP)< 2s< 5s< 10s

STATUS: PLANNED

These targets are engineering contracts. No backend exists to measure against them.

Web Vitals Targets

Frontend performance is measured against Core Web Vitals. These targets align with Google's "Good" thresholds and ensure the app feels fast on school Chromebooks and budget phones alike.

MetricTargetWhat It Measures
LCP (Largest Contentful Paint)< 2.5sWhen main content loads
FID (First Input Delay)< 100msResponsiveness to first interaction
CLS (Cumulative Layout Shift)< 0.1Visual stability — no layout jank
TTFB (Time to First Byte)< 600msServer response time
FCP (First Contentful Paint)< 1.8sWhen first content appears on screen

Availability Targets

MetricTarget
Uptime99.9% (8.7 hours downtime/year max)
Planned maintenance window< 2 hours/month, scheduled outside school hours
Mean time to detect (MTTD)< 5 minutes
Mean time to recover (MTTR)< 1 hour (critical), < 4 hours (non-critical)

Throughput Targets

ResourceTarget
Concurrent users (per server)500
WebSocket connections (if used)10,000 per node
File uploads (concurrent)100 per server
Background job processing1,000 jobs/minute
Email sends10,000/hour
Push notifications50,000/hour

Caching Strategy

Cache Layers

The caching architecture has four tiers. Each request passes through all layers before reaching the database.

[Browser Cache] → [CDN] → [Application Cache (Redis)] → [Database]

Browser cache. Static assets (JS, CSS, images) get Cache-Control: max-age=1year with content-hash filenames — Vite already generates hashed filenames in production builds. API responses use no-cache for dynamic data and short TTLs for semi-static content (categories, featured challenges). A Service Worker caches offline-capable assets.

CDN. All static assets are served via CDN (CloudFront, Cloudflare, or equivalent). User-uploaded media is served via CDN with signed URLs for access control. Geographic distribution reduces latency for users across different school districts. Cache invalidation uses versioned filenames for assets and explicit purge for media.

Application cache (Redis). Hot data lives in Redis with TTL-based expiration. This layer absorbs the vast majority of read traffic.

DataTTLInvalidation Trigger
User session30 daysLogout, password change
User profile15 minutesProfile update
Challenge detail1 hourChallenge edit
Challenge quality scores24 hoursDaily recompute
Recommendation feeds1 hourRecompute
Popular/trending feeds6 hoursDaily recompute
Category list24 hoursCategory change
Search autocomplete6 hoursIndex update
Notification count (unread)Real-timeNew notification / read
Feature flags5 minutesFlag change
Rate limit countersPer windowAutomatic expiry

Database query cache. Prepared statement caching, connection pooling via PgBouncer, and materialized views for expensive aggregations. Analytics views refresh every 15 minutes; leaderboards refresh every hour.

STATUS: PLANNED

No caching infrastructure exists. The frontend runs against in-memory mock data. Vite's content-hashed filenames are the only cache-related behavior currently in place.

Cache Warming

On deployment or cache flush, the system pre-warms critical data to avoid cold-cache latency spikes:

  1. Pre-warm category list
  2. Pre-warm featured content
  3. Pre-warm popular/trending feeds
  4. Pre-warm top 1,000 challenge details
  5. Stagger warming requests to avoid thundering herd

Cache Invalidation Patterns

PatternWhen UsedExample
Write-throughUpdate cache on writeUser profile, challenge detail
TTL-basedLet cache expire naturallyFeeds, scores
Event-drivenInvalidate on eventNotification count on new notification
Never cacheSensitive dataAuthentication tokens, PII lookups

Database Optimization

Database Selection

Primary: PostgreSQL. ACID compliance for transactional data, full-text search for initial search implementation, JSON/JSONB support for flexible schemas, row-level security for data isolation, and mature replication and backup tooling. The local development database (docurious_prod) runs PostgreSQL 17.

Cache/Queue: Redis. Session storage, application cache, rate limiting, job queues, and real-time notification counts.

Future (at scale): Elasticsearch or Meilisearch for search when PostgreSQL FTS is insufficient, a data warehouse for analytics separate from the transactional DB, and read replicas for query distribution.

Indexing Strategy

Every table gets primary key, created_at, and foreign key indexes. Critical query indexes are designed around the actual access patterns:

TableIndexSupports
challenges(status, category_id, created_at)Browse by category
challenges(vendor_id, status)Vendor dashboard
challengesGIN index on search vectorFull-text search
track_records(user_id, status, created_at DESC)My Track Records
track_records(challenge_id, status)Challenge TR gallery
track_records(status, created_at)Verification queue
notifications(user_id, read, created_at DESC)Notification center
assignments(class_id, due_date, status)Student assignments
users(email) UNIQUELogin lookup
community_members(community_id, user_id) UNIQUEMembership check
events(challenge_id, start_time)Event calendar

Partial indexes improve performance for common filtered queries: track_records WHERE status = 'pending_review' (verification queue), challenges WHERE status = 'active' (exclude archived from browse), notifications WHERE read = false (unread count).

Query Optimization Guidelines

  • No N+1 queries — always use eager loading or joins for related data
  • Cursor-based pagination for infinite scroll, offset-based for admin tables
  • Maximum query time — 500ms hard limit, alert on > 200ms
  • EXPLAIN ANALYZE on all new queries touching large tables
  • No SELECT * — always specify columns
  • Limit result sets — maximum 100 per page, 50 default

Connection Management

SettingValue
Connection poolingPgBouncer (recommended)
Pool size20--50 connections per instance
Connection timeout5 seconds
Query timeout30 seconds (hard kill)
Idle connection cleanup10 minutes

Data Partitioning (At Scale)

When tables exceed approximately 50 million rows, partition by month on created_at: audit logs, notifications, analytics events, search analytics. Track records may be partitioned by year if volume warrants it.


Image & Media Optimization

Image Pipeline

[User Upload] → [Validation] → [Virus Scan] → [EXIF Strip]
             → [Store Original] → [Generate Variants] → [CDN]

Five image variants are generated asynchronously for every upload:

VariantDimensionsQualityUse
Thumbnail150 x 150 (crop)80%Grid views, lists
Small400px wide85%Card previews
Medium800px wide85%Detail views
Large1600px wide90%Full-screen view
OriginalAs uploaded100%Download, export

Processing uses WebP format with JPEG fallback. All images below the fold use lazy loading. Responsive images are served via srcset.

Upload Limits

User TypeMax File SizeMax Files per TRDaily Upload Limit
General user50 MB20500 MB
Student (under 13)25 MB10200 MB
Teacher50 MB20500 MB
Vendor100 MB502 GB
Admin100 MBUnlimitedUnlimited

Video Handling

Videos are not hosted on DoCurious. Users provide links (YouTube, Vimeo, etc.) and the platform embeds via oEmbed or iframe. Thumbnails are extracted from the embed provider. Direct video upload with transcoding is a planned enhancement.

Storage Architecture

Cloud object storage (S3 or equivalent) with bucket structure: /{environment}/{content-type}/{user-id}/{file-id}. Content types: avatars, challenge-covers, track-record-media, exports. Lifecycle policies delete orphaned files after 30 days. Cross-region replication provides disaster recovery.

STATUS: PLANNED

The frontend MediaUpload component handles client-side file selection and preview. No server-side image processing, CDN delivery, or storage backend exists.


Frontend Performance Budget

Bundle Strategy

STATUS: BUILT

The frontend implements route-level code splitting with React.lazy for 50+ page components, Suspense boundaries with PageSkeleton fallbacks, and Vite 7 production builds with tree shaking and Tailwind CSS purging. Auth pages (Login, Register, ForgotPassword, ResetPassword, VerifyEmail) are eagerly loaded for fast initial render; everything else is lazy.

Code splitting by route ensures each page loads only its dependencies. The router at src/routes/index.tsx lazy-loads every page component except the five auth pages that are kept eager for fast initial login.

ResourceBudget
HTML document< 50 KB
CSS (total)< 100 KB gzipped
JS (initial route)< 200 KB gzipped
Images (above fold)< 500 KB total
Web fonts< 100 KB total
Total page weight (initial)< 1 MB
HTTP requests (initial)< 30

Target total JS across all routes: < 1 MB gzipped.

Asset Optimization

  • Images: WebP with JPEG fallback, responsive srcset
  • Fonts: System font stack preferred; custom fonts use font-display: swap
  • CSS: Critical CSS inlined, remainder async loaded
  • Icons: SVG sprite or icon font (no individual image requests)
  • Preload: Critical resources via <link rel="preload">
  • Prefetch: Likely next navigation via <link rel="prefetch">

Rendering Strategy

Page TypeStrategyExamples
Landing, About, Privacy, TermsStatic generationLanding.tsx, PrivacyPolicy.tsx, TermsOfService.tsx
Challenge browse, category pagesIncremental Static RegenerationExplore.tsx, ExploreCategoryView.tsx
Dashboard, notifications, adminClient-side renderingDashboard.tsx, Notifications.tsx
All pages (initial load)Server-side rendering for SEOFull SSR pass

STATUS: PARTIAL

The current build is a pure client-side SPA (Vite + React). SSR, static generation, and ISR are planned architectural changes that will require a framework migration (e.g., to Next.js or Remix) or a custom SSR layer.


Background Job Architecture

Job Categories

CategoryQueuePriorityConcurrency
Email deliveryemailMedium10 workers
Push notificationpushMedium5 workers
Image processingmediaLow5 workers
Search index updatesearchLow3 workers
Recommendation recomputerecommendationsLow2 workers
Analytics aggregationanalyticsLow2 workers
Data export generationexportsLow2 workers
Scheduled notificationsschedulerMedium3 workers
Cleanup (expired tokens, etc.)maintenanceLow1 worker

Processing Requirements

  • At-least-once delivery with idempotent job design
  • Dead letter queue for failed jobs after 3 retries
  • Exponential backoff on retry: 1s, 10s, 60s
  • Job timeout: 5 minutes default, 30 minutes for exports and analytics
  • Alert on: queue depth > 1,000, failure rate > 5%, processing latency > 5 minutes

Scheduled Jobs

JobSchedulePurpose
Daily analytics aggregation2:00 AM UTCCompute daily metrics
Challenge quality score recompute3:00 AM UTCUpdate quality/trending scores
Recommendation feed generationEvery hourRefresh user feeds
Popular/trending feed update4:00 AM UTCUpdate global feeds
Digest email generation6:00 AM per timezoneWeekly/daily digests
Streak check12:01 AM per timezoneEvaluate streak status
Expired token cleanup1:00 AM UTCRemove expired sessions, reset tokens
Orphaned file cleanup5:00 AM UTC (weekly)Remove unattached uploads
Backup verification6:00 AM UTC (weekly)Verify backup integrity
Leaderboard recomputeEvery hourUpdate leaderboard rankings
School health score update4:00 AM UTC (daily)Recompute school health
Data retention cleanup3:00 AM UTC (monthly)Remove expired data per retention policy

STATUS: PLANNED

No background job infrastructure exists. All scheduled computations will need a queue system (SQS, BullMQ, or equivalent) and worker processes.


Monitoring & Alerting

Application Monitoring

Metrics to track per endpoint: request rate, response time (p50, p95, p99), error rate (4xx, 5xx), active users (real-time), database query time per query pattern, cache hit/miss ratio, and background job queue depth and latency.

Recommended tooling: APM via Datadog, New Relic, or open-source Grafana + Prometheus. Error tracking via Sentry. Log aggregation via ELK stack, Datadog Logs, or CloudWatch.

Infrastructure Monitoring

MetricAlert Threshold
CPU utilization80%
Memory utilization85%
Disk utilization80%
Database connections (active/idle/waiting)90% pool used
Redis memory usage80%
SSL certificate expiry< 14 days

Alerting Rules

AlertConditionSeverityChannel
High error rate5xx rate > 1% for 5 minCriticalPager + Slack
Slow responsesp95 > 2s for 10 minHighSlack
Database connection exhaustion> 90% pool usedCriticalPager + Slack
Disk space low> 80% usedHighSlack
Job queue backing upDepth > 1,000 for 15 minHighSlack
Certificate expiring< 14 daysMediumEmail
Memory pressure> 90% for 5 minHighSlack
Zero trafficNo requests for 5 minCriticalPager

Operations Dashboards

Three dashboards are specified:

Operations Dashboard: Request rate, error rate, response time (real-time), active users, system health (CPU, memory, disk), recent deployments marked on timeline.

Database Dashboard: Query rate and latency, connection pool status, slow query log, replication lag.

Background Jobs Dashboard: Queue depths by category, processing rate, failure rate, recent failures with details.

STATUS: PLANNED

No monitoring infrastructure exists. The frontend has a dev-only Debug Panel (Ctrl+Shift+D) with a Network Log tab that records mock API calls — this is a development tool, not production monitoring.


Scalability Architecture

Horizontal Scaling

ComponentScaling ApproachTrigger
Web/API serversAuto-scale based on CPU / request rateCPU > 70% or requests > 80% capacity
Background workersScale by queue depthQueue depth > 500
Database (reads)Add read replicasRead latency > 100ms or CPU > 70%
Database (writes)Vertical scale first, partition laterWrite latency > 50ms
RedisCluster mode at scaleMemory > 80%
File storageManaged service (auto-scales)N/A
CDNManaged service (auto-scales)N/A

Architecture Scaling Tiers

Small (up to 5K users): Single application server (or 2 for redundancy), single PostgreSQL instance, single Redis instance, managed file storage (S3), CDN for static assets.

Medium (5K--50K users): Auto-scaling application servers (2--5 instances), PostgreSQL with read replica, Redis with persistence, dedicated background worker instances, Elasticsearch for search replacing PostgreSQL FTS.

Large (50K--500K users): 5--20 application server instances, PostgreSQL primary + multiple read replicas, Redis cluster, dedicated analytics data warehouse, dedicated search cluster, microservice extraction for high-traffic paths (notifications, recommendations).

Database Read/Write Split

At the Medium tier and above: write operations go to the primary database, read operations go to read replica(s). Application-level routing handles the split. Replication lag is monitored with alerts if it exceeds 1 second. Critical reads (authentication, authorization) always hit the primary.


API Design Standards

Response Format

All API responses follow a consistent envelope:

Success:

json
{
  "data": { ... },
  "meta": {
    "page": 1,
    "per_page": 20,
    "total": 150,
    "total_pages": 8
  }
}

Error:

json
{
  "error": {
    "code": "VALIDATION_ERROR",
    "message": "Human-readable description",
    "details": [
      { "field": "email", "message": "Invalid email format" }
    ]
  }
}

Pagination

Cursor-based (for feeds, infinite scroll): ?cursor={opaque_token}&limit=20. Response includes next_cursor (null if no more results). Default limit 20, max 100.

Offset-based (for admin tables, finite lists): ?page=1&per_page=20. Response includes total count and total pages. Default per_page 20, max 100.

Rate Limiting

Every response includes rate limit headers:

X-RateLimit-Limit: 100
X-RateLimit-Remaining: 95
X-RateLimit-Reset: 1704067200

Load Testing

Test Scenarios

ScenarioTargetTool
Sustained loadHandle expected daily traffic for 1 hourk6, Locust, or Artillery
Peak loadHandle 3x average load for 15 minSame
Spike testHandle sudden 10x traffic for 5 minSame
Soak testHandle average load for 24 hours (memory leaks)Same
School onboarding surge500 students login simultaneouslySame
Media upload burst100 concurrent uploadsSame

Test Frequency

  • Before major releases: full suite
  • Weekly: sustained load test (automated)
  • Monthly: peak and spike tests
  • Quarterly: soak test

Test Environment

Staging environment matching production topology, synthetic data at 2x expected production volume, production-like database size. Results are compared against the response time and throughput targets defined above.

STATUS: PLANNED

No load testing infrastructure or test scripts exist.


Design Decisions

Why cursor-based pagination for feeds? Offset-based pagination breaks when new content is inserted — users see duplicates or miss items. Cursor-based pagination provides stable, consistent results for infinite scroll feeds. Offset-based is kept for admin tables where total count and page navigation are needed.

Why no direct video hosting? Video transcoding, storage, and delivery are expensive and complex. YouTube and Vimeo handle it well. By embedding rather than hosting, DoCurious avoids significant infrastructure cost and complexity while still allowing video evidence in Track Records. Direct upload is a future enhancement.

Why Redis for sessions instead of JWTs? Server-side sessions in Redis allow immediate invalidation on logout or password change. JWTs cannot be revoked before expiry without a blocklist — which is effectively reimplementing server-side sessions. Redis sessions are simpler and more secure for the security model DoCurious needs.

Why PostgreSQL FTS before Elasticsearch? PostgreSQL's built-in full-text search is sufficient for the Launch and early Growth phases. It avoids the operational overhead of a separate search cluster. The migration to Elasticsearch or Meilisearch is planned for when FTS query performance degrades at scale.

Why eager-load auth pages but lazy-load everything else? Login, registration, and password reset are the first screens users see. Lazy-loading them would add a visible loading spinner to the very first interaction. All other pages benefit from code splitting because users only visit a subset of the 55+ pages in any session.


Technical Implementation

Current Frontend Performance Stack

LayerImplementationFile
Route code splittingReact.lazy + Suspense for 50+ page componentssrc/routes/index.tsx
Loading fallbacksPageSkeleton component during lazy loadsrc/components/common/PageSkeleton.tsx
Error boundariesErrorBoundary wrapping route segmentssrc/components/common/ErrorBoundary.tsx
Build toolingVite 7 with tree shaking, minification, content hashingvite.config.ts
CSS optimizationTailwind CSS 4 with automatic purging of unused classes@tailwindcss/vite plugin
Component memoizationuseMemo / useCallback / memo used across 52 filesVarious components
Dev performance inspectionDebug Panel with Network Log tab (dev-only)src/components/debug/
Type checkingTypeScript strict mode for compile-time safetytsconfig.json
TestingVitest with jsdom for unit/component testsvitest config

What Needs Building

FeaturePriorityDependencies
Backend API with response time enforcementCriticalExpress/Fastify server, PostgreSQL
Redis caching layer with TTL strategyCriticalRedis instance, backend API
CDN configuration for static + media assetsCriticalCloud provider (AWS/Cloudflare)
Image processing pipeline (variant generation)HighSharp/Pillow, object storage, background workers
Connection pooling (PgBouncer)HighPostgreSQL deployment
Database indexing per spec strategyHighBackend ORM / migration tooling
Background job queue (BullMQ, SQS)HighRedis or SQS, worker processes
APM + error tracking (Sentry, Datadog)HighProduction deployment
Rate limiting middlewareMediumRedis, API gateway
Load testing scripts and CI pipelineMediumk6 or Locust, staging environment
SSR / static generation for SEO pagesMediumFramework migration or custom SSR
Service Worker for offline assetsLowFrontend PWA setup
Elasticsearch migration for searchLowScale-phase trigger
Data partitioning for large tablesLow50M+ row threshold

  • Explore & Discovery — Search and recommendation performance targets (< 100ms search, < 150ms recommendations) drive database indexing and caching strategy
  • Challenges — Challenge browse, detail, and gallery pages are the highest-traffic read paths that caching and CDN must optimize
  • Track Records — TR media uploads drive the image processing pipeline, storage architecture, and upload limit enforcement
  • Notifications — Notification delivery infrastructure (50K push/hour, 10K email/hour) requires background job queues and real-time cache invalidation
  • Gamification — Leaderboard recomputation, XP aggregation, and streak checks are scheduled background jobs with materialized view dependencies
  • School Administration — School onboarding surges (500 simultaneous student logins) are a key load testing scenario; school health scores are daily scheduled recomputes
  • Vendor — Vendor analytics dashboards rely on materialized views refreshed every 15 minutes
  • Accounts — Authentication, session management, and rate limiting are performance-critical paths that always hit the primary database

DoCurious Platform Documentation