Design 4: Designing a Search Autocomplete System
In this design, we’ll explore how to build a real-time search autocomplete system that provides instant suggestions as users type, similar to Google, Amazon, and YouTube. The goal is to achieve sub-50ms p95 latency, handle millions of queries, and deliver relevant, ranked results.
Step 1: Define Functional Requirements
- Provide per-keystroke suggestions as users type
- Support prefix search with typo tolerance (fuzzy matching)
- Rank results by popularity, relevance, or personalization
- Continuously update index with new data
- Ensure high availability across regions
Step 2: Define Non-Functional Requirements
- Latency: sub-50ms response time for p95 queries
- Scalability to millions of queries per second
- Fault-tolerant and highly available design
- Eventual consistency acceptable for data ingestion
- Low infra cost via caching hot queries
Step 3: Define API Services
- GET
/autocomplete?q={query}– Returns ranked suggestions - POST
/index– Ingests new data into search index - GET
/health– Health check of autocomplete service
Step 4: High-Level Architecture
- Frontend App: Captures keystrokes, applies debouncing, calls API
- Backend API (Java / Node): Handles autocomplete requests
- Amazon API Gateway + ALB: Entry point for queries
- In-Memory Cache (Redis / ElastiCache): Stores hot queries for sub-ms lookup
- Search Engine (Elasticsearch / OpenSearch): Prefix-based text index
- Ranking Engine: Orders results using popularity, personalization
- Data Pipeline (Kinesis / Kafka + Lambda): Streams logs and updates index
- Amazon S3 + Redshift: Stores query logs and training data for ML ranking
- Monitoring (CloudWatch + X-Ray): Tracks latency and errors
Step 5: Query Flow Example
- User types
"iph"into search bar. - Frontend waits for 200ms (debounce) before sending request.
- Request hits API Gateway → Autocomplete Service.
- Service checks Redis cache for prefix
"iph". - If cached, return results instantly (e.g., iPhone, iPad).
- If not cached, query OpenSearch/Elasticsearch prefix index.
- Pass results to Ranking Engine (sort by popularity/personalization).
- Cache results in Redis for subsequent requests.
- Return suggestions to frontend within <50ms.
Step 6: Key Architectural Decisions
- Use debouncing on frontend to reduce unnecessary requests
- Cache hot queries in Redis for ultra-low latency
- Use OpenSearch/Elasticsearch for prefix-based indexes
- Sharding index for scalability across servers
- Implement ranking logic using ML or heuristics
- Stream search logs into Kinesis + S3 for analytics
Step 7: Additional Considerations
- Support multi-language indexing and suggestions
- Personalize based on user history
- Ensure failover: fallback to cached results if index unavailable
- Protect APIs from abuse using rate limiting
- Track metrics: p95 latency, cache hit ratio, query volume
Conclusion
A scalable search autocomplete system must balance latency, scalability, and relevance. By leveraging AWS services like API Gateway, ElastiCache (for caching), OpenSearch (for indexing), Kinesis (for streaming), and S3/Redshift (for analytics), we can deliver highly relevant suggestions in real time, even at internet scale, while ensuring strong fault tolerance and observability.
