FourthBrain - ML Model Drift and Decay
## Metadata
URL:: https://www.youtube.com/watch?v=C1yiOonTYqM
Author:: FourthBrain
<!--ID: 1677121825250-->
## Literature Notes
```dataview
TABLE rows.file.link AS "Literature Note", rows.file.cday AS "Date"
FROM #note/literature AND [[Example Title]]
GROUP BY file.link
sort date ASCENDING
```
Note: there may not be many literature notes for this, because it's a how-to type of talk, where there won't be notes on the piece itself, but on what it teaches.
## Fleeting Notes
Supervised ML is interpolation geometry. "All training data exists as points in a Euclidean space...we can create an enclosure. Then, inside that enclosure, we can interpolate." ^jea50m
Global problem space -> regional problem space -> training domain -> model. When we get data, there are quantity and quality gaps which are smaller than the ideal training domain. This means that a model won't work well at trying to predict on observations that fall outside the training domain because **machine learning models for non-linear problems are interpolation machines**. They can not extrapolate. ^hhdfps
Active learning can fix gaps between the training domain and regional problem space. This flags and prioritizes confusing inputs for human ground-truthing and augments the training set with newly labeled observations to fill gaps in the training domain. ^tkrs18
Most machine learning methods are based on the assumption that the problem space does not change over time. In the real world, though, the problem space does indeed change - this is called **non-stationarity**. In other words, the problem space *drifts* out of alignment with the existing training domain and model (**data drift**). ^es3vfo
Addressing data drift can take the form of filling gaps with active learning or pruning old or obsolete observations by using a weighting scheme (among other strategies). ^1w22m3
MLOps and Pipelines are beneficial for monitoring models for appreciable drift and refitting the model on augmented training data when drift is detected. ^tcwz2j
Two kinds of drift:
- **concept drift** - or real drift; a shift in the relationship between model inputs and outputs (changes p(y | x)). This always causes model decay.
- **data drift** - change to the statistical distributions within the input data (also known as *input drift*, *feature drift*, *covariate shift*). This **can** cause model decay by driving a change in p(y | x). If it doesn't cause that change, then it's called *virtual drift*. ^q8fzfd
Both kinds of drift can manifest in 4 modes:
1. Abrupt
2. Incremental
3. Gradual
4. Reoccurring (like seasonality) ^q52i2i