Machine Learning System Design Interview

**Note:** See [[Statuses#Books|here]] for valid statuses > [!META]- Inline Metadata [author:: [[Ali Aminian]] [[Alex Xu]]] [recommended by:: ] [status:: reading] [related goals:: ] [link:: ] [tags:: #source/books #concepts/programming/machine-learning/ml-engineering ] [rating:: ] [up:: [[Machine Learning MOC]]] # Summary What are the key ideas? How can I apply this knowledge that I learned? How do these ideas relate to what I already know? # 💡Ideas Note: These should become Project Idea fleeting notes with a link back to this source. # Fleeting Notes Note: Conceptual notes that aren't as tied to the source material. These are fodder for processing into evergreen notes. # Literature Notes Note: Commentary on the book or things you don't want to forget. Maybe for use in a public mind garden. Should have a visible source reference. These are "weakly evergreen" because they're tied heavily to a single piece. Must be in your own words. These should be processed into Evergreen notes, or used as fodder for them. "“Literature notes”, titled after a single work and meant primarily as linkages to other more durable notes, and as targets for backlinks. I write these roughly as “outline notes,” except for someone else’s ideas." These should be regularly refactored into individual notes in `Sources/! Literature Notes/` ```dataview TABLE rows.file.link AS "Literature Note", rows.file.cday AS "Date" FROM #note/literature AND [[Machine Learning System Design Interview]] GROUP BY file.link sort date ASCENDING ``` # Summaries ## Chapter 1: Introduction and Overview ### Process for Answering ML System Design Question 1. Clarify requirements - this should include: - Business objectives - Systems' features - Data - Constraints - Scale - Performance needs (e.g. real-time vs. batch inference, on/offline training, etc.) 2. Frame problem as a machine learning task: - e.g. Interview asks to increases viewers of streaming service. This isn't inherently ML, but you can assume ML could be helpful here. 1. Define ML objective (e.g. maximize time users spend watching videos) 2. Specify inputs and outputs 3. Choose right ML category ### Components of a Machine Learning System #### Data Preparation This component includes data engineering (loading and storing data, etc.) as well as **feature engineering** - using domain knowledge to select and extract predictive features from raw data and then transforming said predictive features into so mething usable by the model. There is also feature engineering ops, which includes things like handling missing values (deletion vs. imputation), feature scaling (*normalization* -> doesn't change distribution; *standardization* -> changes distribution to have a mean of 0 and std. dev of 1; log scaling), **discretization** (*bucketing*) - converts continuous features into discrete ones, and more. Another common requirement is encoding categorical features as numbers. There are 3 methods to do this: 1. integer encodings - each category gets an integer value 2. one-hot encoding - new binary feature for each category 3. embedding learning - mapping of categorical feature into N-dimensional vector. This is useful when the number of unique values the feature can take is very large. #### Model Development Select model and train it to solve a task 1. Selection - establish simple baseline - Experiment with simple models, then more complex ones if needed, then ensemble models 2. Training - Construct dataset: - Get raw data - ID features and labels (for trad. ML), either by hand or naturally - Select sampling strategy - Split data (training, eval/validation, test) - Address any class imbalance via resampling or altering loss function - Choose (existing) loss fxn - interviewer won't expect you to create one - Address training from scratch vs. fine-tuning - Distributed training ([[Data Parallelism]] vs. [[Model Parallelism]]) #### Evaluation - Online - subjecive choice of metrics. You pick metrics that are relevant to the problem you're trying to solve, such as click-through rate for ad click prediction. - Offline - evaluate the performance of the model during model development, using objective metrics based on the task, such as precision and recall for classification or MSE for regression. #### Deployment and Serving - Cloud vs. on-device - Model composition - specifically how to represent numbers with fewer bits - **knowledge distillation** - train smaller models to approximate bigger one - pruning - quantization - Testing in prod - A/B testing - **Shadow deployment** - Requests to both models, but only old ones serves inferences to users - Prediction pipeline - Batch pipeline - model makes predictions regularly and stores them so requests are returned faster. - Downside is that you need advanced knowledge of what needs to be computed - Online predictions - requests trigger predictions and they will be returned to user. - Downside is that this can be slow #### Monitoring - Be prepared to discuss: - Why a system fails in production - Data drift typically. Can be addressed by training on large datasets or regular retraining - What to monitor - Ops metrics - Throughput - Average serving times - CPU/GPU utilization - ML metrics - inputs/outputs - drifts - accuracy - versions ## Chapter 2: Visual Similarity Search ### 1: Clarify Requirements - Will there be rankings? - Yes, rank by closest match to query - Support image crop and search? - Yes - Can users interact with images? - Assume only like ### 2: Frame as ML Task - ML objective: retrieve visually similar images - input: query image - output: visually similar images, ranked - Visual search is a ranking problem, use representation learning to learn embeddings and use distance in the embedding space from query image to range by similarity to query image ### 3: Data Preparation #### Data engineering - Available data: - Images - ID - Owner ID - Upload time - Manual tags - Users - ID - Age - Email - etc - User-image interactions - User ID - Query image ID - Interaction type #### Feature engineering - Image preprocessing to include - Resizing - Scaling - [[Z-Score Normalization]] - scale all pixel vectors to mean of 0 and variance of 1 - Consistent color mode (e.g. RGB across all images) ### 4: Model development #### Model selection - Neural networks -> good at handling unstrctured data - CNNs like ResNet or transformers like ViT can be used here #### Model training - Use constrastive training - Provide model with several dissimilar images to query, plus 1 similar. It will learn to make clearer representations of similar images - I. Constructing the dataset (labeling) options - Using human judgment - slow, expensive - User interactions - noisy - Artificially create similar images - choose this for this problem, but constructed dataset may not look like real data - II. Choose loss function - Use a contrastive loss function, which generally uses this pattern - 1: Compute similarities (embedding distances) - 2: Softmax - 3: Cross-entropy: compare predicted probabilities to ground truth ### 5: Evaluation - We have a dataset that enables offline evaluation - Assume we have query images, candidates, and similarity scores. Offline metrics that can be used: - Mean reciprocal rank (MRR) - rank of 1st relevant image in each output list, averaged. Doesn't measure precision or ranking quality - Recall @ K - # relevant items among top K in output / total relevant items - If lots of relevant images (like in a search engine) this may skew low - Precision @ K - Proportion of relevant items in top K items model deems relevant. Doesn't consider ranking quality - number of relevant items among top k items in output list / k - mAP - Calculate average precision (AP) for each output list, then average them together. Overall ranking quality is considered, but works best for binary outcomes - $\sum_{i = 1}^{k} \frac{precision \space at \space i \space if \space i'th \space item \space is \space relevant \space to \space user}{total \space relevant \space items}$ - nDGC - Calculates cumulative gain of items by summing relevance scores, then accumulate them, discounting lower ranked items, then normalize against "ideal" **DGC** (discounted cumulative gain) - Online metrics - Click-through rate: # clicked images / # images presented - Avg time spent on images - Remember, these are noisy ### 6: Serving #### Prediction Pipeline - Embedding service: preprocess query image and gets embedding - Nearest neighbor service gets most similar images - Re-ranking service: business logic to do things like remove NSFW results from results sent to user, etc. #### Indexing Pipeline - Indexing service: indexes image embeddings ### 7: Misc - Nearest neighbors algorithms - Exact NN - linear time -> too slow because it searches whole index table - ANNs - Tree-based ANN: partition space based on criteria into tree with nodes - Locality sensitive hashing: reduce dimensions with hashing algorithm and use it to group into buckets. Then you only search the bucket the query range belongs to - Clustering - form clusters based on similarity then only check cluster that query image is in # What Does the Book Say? What are the author's arguments? What problems did the author solve, and what did they fail to?