Designing Machine Learning Systems

> [!META]- Inline Metadata [author:: [[Chip Huyen]]] [recommended by:: ] [status:: reading] [related goals:: ] [link:: ] [tags:: #source/books] [rating:: ] # Summary What are the key ideas? How can I apply this knowledge that I learned? How do these ideas relate to what I already know? # 💡Ideas Note: These should become Project Idea fleeting notes with a link back to this source. # Fleeting Notes Note: Conceptual notes that aren't as tied to the source material. These are fodder for processing into evergreen notes. # Literature Notes Note: Commentary on the book or things you don't want to forget. Maybe for use in a public mind garden. Should have a visible source reference. These are "weakly evergreen" because they're tied heavily to a single piece. Must be in your own words. These should be processed into Evergreen notes, or used as fodder for them. "“Literature notes”, titled after a single work and meant primarily as linkages to other more durable notes, and as targets for backlinks. I write these roughly as “outline notes,” except for someone else’s ideas." These should be regularly refactored into individual notes in `Sources/! Literature Notes/` ```dataview TABLE rows.file.link AS "Literature Note", rows.file.cday AS "Date" FROM #note/literature AND \[\[title]] GROUP BY file.link sort date ASCENDING ``` # Highlights/Summaries Note: Go through these to restructure, and bold and highlight important pieces. These should be under an outline with chapters ## Chapter 1: Overview of Machine Learning Systems > [!NOTE] > This isn't a typical summary, since this is kind of a review chapter, but it had good nuggets I wanted to recover. Machine learning is an approach to learn complex patterns from existing data and use these patterns to make predictions on unseen data. When to use ML 1. There is data to learn a pattern from 2. There are complex patterns to learn 3. Existing data is available or collectible 4. The problem is predictive 5. There is unseen data that shares patterns with training data 6. The problem is repetitive 7. The cost of wrong predictions is cheap 8. It's at scale 9. The patterns are constantly changing **zero-shot learning** - where an ML system can make good predictions for a task without having been trained specifically on data for that task. It still has learned from some dataset, however. **continual learning** - continual learning systems can be deployed with no data and train on incoming production data while deployed. Higher latency can mean higher throughput, sometimes. **When ML is deployed at scale, it can discirminate against people at scale** ## Chapter 2: Introduction to Machine Learning Systems Design ML systems design takes a holistic systems approach to ML Ops. Four major requirements: ^cn63hj 1. [[Designing Machine Learning Systems#Reliability|Reliability]] 2. [[Designing Machine Learning Systems#Scalability|Scalability]] 3. [[Designing Machine Learning Systems#Maintainability|Maintainability]] 4. [[Designing Machine Learning Systems#Adaptability|Adaptability]] ### Business and ML Objectives Most businesses don't care about ML metrics unless they achieve some business goal or increase performance. Thus it is important to communicate business gains from some ML project. Some companies create metrics that map business metrics to ML metrics, like Netflix's *take rate* which is the number of quality plays divided by the number of recommendations a user sees. This helps evaluate their recommender system - the higher the take rate, the better the recommender. Experiments are often needed to establish a link between ML metrics and business outcomes. ### Requirements for ML Systems Most systems should have [[Designing Machine Learning Systems#^cn63hj|these 4 characteristics]]. #### Reliability The system should continue to perform the correct function at the desired level of performance even in the face of adversity such as hardware faults, software faults, and human error. #### Scalability An ML system can grow in many ways: complexity (linear regression model to a large neural network, e.g.), traffic volume, model count, and more. This can be dealt with by scaling resources, but artifact management also has to be scaled (where are 8000 models stored? How is their performance measured?) #### Maintainability Workloads and infrastructure should be set up so that different contributors can work using tools that they are comfortable with. Code should be documented, and code, data, and artifacts should all be versioned. #### Adaptability System should able to be updated or evolved without disruption to users. ### Iterative Process ML systems development is an iterative process that often loops back on itself and occurs in multiple cycles. 1. Project scoping: Laying out goals, objectives, constraints, ID'ing stakeholders, etc. 2. Data engineering 3. Model development 4. Deployment 5. Monitoring and continual learning 6. Business analysis ### Framing ML Problems #### Types of ML Tasks ##### Classification vs. Regression Classification models classify inputs into categories, while regression models output a continuous value. The model types can be framed as the other - a regression task can become a classification one if you bin the possible values, and vice versa if you cast the output to values between 0 and 1 and decide on a threshold/decision point. ##### Binary vs. Multiclass Classification Binary classification - two possible classes Multiclass - self-explanatory If there are a lot of classes in a problem it has *high cardinality*. In cases like this, a hierarchical classification might be useful, where one model classifies inputs into some larger grouping, and then another one classifies into subgroups, etc. ##### Multiclass vs. Multilabel Classification Multilabel classification - when an example can belong to multiple classes Approaches to multilabel classification: 1. Treat it as a multiclass classification where more than one class can be true 2. Treat it as a set of binary classification problems. 1 binary model for the universe of classes you have available, and run the same example against them all. Multilabel classification has two major issues that are related to the fact that in MLC examples can have a varying number of labels: 1. It makes it difficult for label annotation because it increases label multiplicity issues - more possible classes mean more possible uncertainty in labeling. 2. The varying number of classes per example makes it hard to extract predictions from raw probability ##### Multiple Ways to Frame a Problem #### Objective Functions Also known as a loss function. In supervised machine learning, the loss can be computed by comparing model's outputs with ground truth labels with a measurement like root mean squared error (RMSE) or cross entropy. ##### Decoupling Objectives Many models may have multiple objectives, which could lead to multiple loss functions (such as rank posts by quality, rank by engagement, filter misinformation, filter spam, etc.) - and there are a few ways to handle this: 1. Combine losses into a single one, e.g. $new\_loss = \alpha \ quality\_loss + \beta \ engagement\_loss$ 1. You can randomly test different values for $\alpha$ and $\beta$ or use Pareto optimization 2. Each time you tune $\alpha$ or $\beta$ you'll have to retrain your model 2. Train two different models, each optimizing one loss 1. Then, you can combine their outputs and rank posts by their combined scores/losses # What Does the Book Say? What are the author's arguments? What problems did the author solve, and what did they fail to? # Blank Page Mindmap Note: After determining what the book is about, write everything you know about the subject on a blank page. After each reading session, write anything you've learned in a different color ink. # What is the Book About? **Note**: See [[How To Read a Book]] What kind of book is it? What is the book about in 1-2 sentences at most? Outline the book. What problems does the author try to solve?