A model's

**representational capacity**is its ability to fit a wide variety of functions. Models with low capacity may struggle to fit the training set (high training error). Models with high capacity can overfit by memorizing properties of the training set that do not server them well on the test set.

**Overfitting**is the situation where a learning algorithm achieves low training error but high test error. Overfitting is sign of poor

**generalization**.

Simpler models (smaller hypothesis space and smaller capacity) are more likely to

**generalize**(small gap between training and test error) however complex models are more likely to achieve low training error.

In practise the learning algorithm may not be able to find the best model among the model's hypothesis space. This additional limitations such as the imperfection of the optimization algorithm mean that the learning's algorithm

**effective capacity**may be less than the representational capacity of the model family.

Statistical learning theory provides a way to quantify a model's capacity. The

**Vapnik-Chervonenkis dimension**or

**VC dimension**measures the capacity of a binary classifier. It is defined as being the largest possible value of m for which there exists a training set of m different x points that the classifier can label arbitrarily.

Thus the discrepancy between training error and generalization error is bounded from above by the quantity that grows as the model capacity grows but shrinks as the number of training examples increases.

**Regularization**is any modification we make to a learning algorithm that is intended to reduce its generalization error but not its training error. Without regularization any search on the hyperparameters of a model would result on those that maximize the model's capacity resulting in overfitting.

Bias and variance measure two different sources of error in an estimator.

**Bias**measures the expected deviation from the true value of the function or parameter. The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).

**Variance**provides a measure of the deviation from the expected estimator value that any particular sampling of the data is likely to cause. The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.

The relationship between bias and variance is tightly linked to the machine learning concepts of capacity, underfitting and overfitting. When regularization error is measured by Mean Square Error (where bias and variance are meaningful components of generalization error), increasing capacity tends to increase variance and decrease bias.

In the context of deep learning, most regularization strategies are based on regularizing estimators. Regularization of an estimator works by trading increased bias for reduced variance. An effective regularizer is one that makes a profitable trade, reducing variance significally while not overly increasing the bias.

* from the book Deep Learning