FourthBrain - ML Model Drift and Decay

FourthBrain - ML Model Drift and Decay ## Metadata URL:: https://www.youtube.com/watch?v=C1yiOonTYqM Author:: FourthBrain  ## Literature Notes ```dataview TABLE rows.file.link AS "Literature Note", rows.file.cday AS "Date" FROM #note/literature AND [[Example Title]] GROUP BY file.link sort date ASCENDING ``` Note: there may not be many literature notes for this, because it's a how-to type of talk, where there won't be notes on the piece itself, but on what it teaches. ## Fleeting Notes Supervised ML is interpolation geometry. "All training data exists as points in a Euclidean space...we can create an enclosure. Then, inside that enclosure, we can interpolate." ^jea50m Global problem space -> regional problem space -> training domain -> model. When we get data, there are quantity and quality gaps which are smaller than the ideal training domain. This means that a model won't work well at trying to predict on observations that fall outside the training domain because **machine learning models for non-linear problems are interpolation machines**. They can not extrapolate. ^hhdfps Active learning can fix gaps between the training domain and regional problem space. This flags and prioritizes confusing inputs for human ground-truthing and augments the training set with newly labeled observations to fill gaps in the training domain. ^tkrs18 Most machine learning methods are based on the assumption that the problem space does not change over time. In the real world, though, the problem space does indeed change - this is called **non-stationarity**. In other words, the problem space *drifts* out of alignment with the existing training domain and model (**data drift**). ^es3vfo Addressing data drift can take the form of filling gaps with active learning or pruning old or obsolete observations by using a weighting scheme (among other strategies). ^1w22m3 MLOps and Pipelines are beneficial for monitoring models for appreciable drift and refitting the model on augmented training data when drift is detected. ^tcwz2j Two kinds of drift: - **concept drift** - or real drift; a shift in the relationship between model inputs and outputs (changes p(y | x)). This always causes model decay. - **data drift** - change to the statistical distributions within the input data (also known as *input drift*, *feature drift*, *covariate shift*). This **can** cause model decay by driving a change in p(y | x). If it doesn't cause that change, then it's called *virtual drift*. ^q8fzfd Both kinds of drift can manifest in 4 modes: 1. Abrupt 2. Incremental 3. Gradual 4. Reoccurring (like seasonality) ^q52i2i