Regularization (mathematics)

The green and blue functions both incur zero loss on the given data points. A learned model can be induced to prefer the green function, which may generalize better to more points drawn from the underlying unknown distribution, by adjusting $\lambda$ , the weight of the regularization term.

In mathematics, statistics, finance,^[1] and computer science, particularly in machine learning and inverse problems, regularization is a process that changes the result answer to be "simpler". It is often used to obtain results for ill-posed problems or to prevent overfitting.^[2]

Although regularization procedures can be divided in many ways, the following delineation is particularly helpful:

Explicit regularization is regularization whenever one explicitly adds a term to the optimization problem. These terms could be priors, penalties, or constraints. Explicit regularization is commonly employed with ill-posed optimization problems. The regularization term, or penalty, imposes a cost on the optimization function to make the optimal solution unique.
Implicit regularization is all other forms of regularization. This includes, for example, early stopping, using a robust loss function, and discarding outliers. Implicit regularization is essentially ubiquitous in modern machine learning approaches, including stochastic gradient descent for training deep neural networks, and ensemble methods (such as random forests and gradient boosted trees).

In explicit regularization, independent of the problem or model, there is always a data term, that corresponds to a likelihood of the measurement and a regularization term that corresponds to a prior. By combining both using Bayesian statistics, one can compute a posterior, that includes both information sources and therefore stabilizes the estimation process. By trading off both objectives, one chooses to be more addictive to the data or to enforce generalization (to prevent overfitting). There is a whole research branch dealing with all possible regularizations. In practice, one usually tries a specific regularization and then figures out the probability density that corresponds to that regularization to justify the choice. It can also be physically motivated by common sense or intuition.

In machine learning, the data term corresponds to the training data and the regularization is either the choice of the model or modifications to the algorithm. It is always intended to reduce the generalization error, i.e. the error score with the trained model on the evaluation set and not the training data.^[3]

One of the earliest uses of regularization is Tikhonov regularization (ridge regression), related to the method of least squares.

Regularization in machine learning

In machine learning, a key challenge is enabling models to accurately predict outcomes on unseen data, not just on familiar training data. Regularization is crucial for addressing overfitting—where a model memorizes training data details but can't generalize to new data—and underfitting, where the model is too simple to capture the training data's complexity. This concept mirrors teaching students to apply learned concepts to new problems rather than just recalling memorized answers.^[4] The goal of regularization is to encourage models to learn the broader patterns within the data rather than memorizing it. Techniques like Early Stopping, L1 and L2 regularization, and Dropout are designed to prevent overfitting and underfitting, thereby enhancing the model's ability to adapt to and perform well with new data, thus improving model generalization.^[4]

Early Stopping

Stops training when validation performance deteriorates, preventing overfitting by halting before the model memorizes training data.^[4]

L1 and L2 Regularization

Adds penalty terms to the cost function to discourage complex models:

L1 regularization (also called LASSO) leads to sparse models by adding a penalty based on the absolute value of coefficients.
L2 regularization (also called ridge regression) encourages smaller, more evenly distributed weights by adding a penalty based on the square of the coefficients.^[4]

Dropout

Randomly ignores a subset of neurons during training, simulating training multiple neural network architectures to improve generalization.^[4]

Classification

Empirical learning of classifiers (from a finite data set) is always an underdetermined problem, because it attempts to infer a function of any $x$ given only examples $x_{1},x_{2},\dots ,x_{n}$ .

A regularization term (or regularizer) $R(f)$ is added to a loss function:

\min _{f}\sum _{i=1}^{n}V(f(x_{i}),y_{i})+\lambda R(f)

where

V

is an underlying loss function that describes the cost of predicting

f(x)

when the label is

y

, such as the square loss or hinge loss; and

\lambda

is a parameter which controls the importance of the regularization term.

R(f)

is typically chosen to impose a penalty on the complexity of

f

. Concrete notions of complexity used include restrictions for smoothness and bounds on the vector space norm.^[5]^{[page needed]}

A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution (as depicted in the figure above, where the green function, the simpler one, may be preferred). From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters.^[6]

Regularization can serve multiple purposes, including learning simpler models, inducing models to be sparse and introducing group structure^{[clarification needed]} into the learning problem.

The same idea arose in many fields of science. A simple form of regularization applied to integral equations (Tikhonov regularization) is essentially a trade-off between fitting the data and reducing a norm of the solution. More recently, non-linear regularization methods, including total variation regularization, have become popular.

Generalization

Regularization can be motivated as a technique to improve the generalizability of a learned model.

The goal of this learning problem is to find a function that fits or predicts the outcome (label) that minimizes the expected error over all possible inputs and labels. The expected error of a function $f_{n}$ is:

I=\int _{X\times Y}V(f_{n}(x),y)\rho (x,y)\,dx\,dy

where

X

and

Y

are the domains of input data

x

and their labels

y

respectively.

Typically in learning problems, only a subset of input data and labels are available, measured with some noise. Therefore, the expected error is unmeasurable, and the best surrogate available is the empirical error over the $N$ available samples:

I_{S}={\frac {1}{n}}\sum _{i=1}^{N}V(f_{n}({\hat {x}}_{i}),{\hat {y}}_{i})

Without bounds on the complexity of the function space (formally, the reproducing kernel Hilbert space) available, a model will be learned that incurs zero loss on the surrogate empirical error. If measurements (e.g. of

x_{i}

Navigácia: Veda >

Analytika
Antropológia
Aplikované vedy
Bibliometria
Dejiny vedy
Encyklopédie
Filozofia vedy
Forenzné vedy
Humanitné vedy
Knižničná veda
Kryogenika
Kryptológia
Kulturológia
Literárna veda
Medzidisciplinárne oblasti
Metódy kvantitatívnej analýzy
Metavedy
Metodika

Metodológia vedy
Náboženstvo a veda
Náučná literatúra
Podvody vo vede
Popularizácia vedy
Potravinárstvo
Prírodné vedy
Pseudoveda
Scientometria
Spoločenské vedy
Teórie
Teatrológia
Technické vedy
Technika
Terminológia
Umenie
Výskum

Veda
Veda a technika podľa štátu
Veda a technika podľa kontinentu
Veda a technika podľa roka
Veda v kozme
Vedci
Vedecká literatúra
Vedecké databázy
Vedecké experimenty
Vedecké konferencie
Vedecké metódy
Vedecké ocenenia
Vedecké organizácie
Vedecké parky
Vedeckí spisovatelia
Vzdelávanie
Záhady

Príbuzné výrazy:

Text je dostupný za podmienok Creative Commons Attribution/Share-Alike License 3.0 Unported; prípadne za ďalších podmienok.
Podrobnejšie informácie nájdete na stránke Podmienky použitia.

[1]

[2]

[3]

[4]

[5]

[6]