Design 3: Designing a Scalable File Upload System

In this design, we’ll explore how to build a robust file upload system that supports large CSV/Excel files, handles validations, allows error recovery, and is scalable using event-driven and big data components.

Step 1: Define Functional Requirements

  • Upload large CSV/Excel files through a web UI
  • Validate each record with business rules
  • Store valid records in database or data lake
  • Track and log failed records with error reasons
  • Allow users to download failed records
  • Support reprocessing or reupload of failed data
  • Provide real-time status updates for uploads

Step 2: Define Non-Functional Requirements

  • Scalability to handle millions of rows
  • Resilient and fault-tolerant processing
  • Secure file handling and access control
  • Low latency for upload acknowledgment
  • Monitoring and retry capabilities

Step 3: Define API Services

  • POST /upload – Accepts file and metadata
  • GET /upload-status/{uploadId} – Checks processing progress
  • GET /failed-records/{uploadId} – Downloads failed records
  • POST /reprocess – Accepts corrected records for retry

Step 4: High-Level Architecture

  • Frontend App: Uploads file via UI, shows progress
  • Backend API (Java): Accepts upload, stores in S3
  • Amazon S3: Stores raw files and failed records
  • Event Bus (Amazon SNS/SQS or EventBridge): Triggers async processing
  • Worker/Processor (AWS Lambda / ECS Fargate): Validates and ingests records
  • Amazon RDS / Redshift: Stores valid data for downstream use
  • Status Tracker (DynamoDB + CloudWatch): Tracks job state and record stats

Step 5: Key Architectural Decisions

  • Use S3 to decouple file ingestion and processing
  • Adopt event-driven flow using EventBridge or SQS for scalable async processing
  • Chunk large files for parallel processing and fault isolation
  • Store failed records separately in S3 with retry metadata
  • Design workers to be idempotent and support partial reprocessing

Step 6: Additional Considerations

  • Support for resumable or chunked uploads
  • Role-based access control for file uploads
  • Client-side validations before file submission
  • Observability: Upload dashboards, error metrics
  • Future monetization of recorded data processing pipelines

Conclusion

A scalable file upload system requires asynchronous processing, robust validations, and strong observability. By leveraging event-driven architecture and AWS services like S3 (for file storage), Lambda or ECS (for processing), SNS/SQS (for messaging), and Redshift or RDS (for downstream storage), we can build a system that handles millions of records efficiently while ensuring a seamless user experience and reliable failure recovery.