K-nearest neighbour

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951,^[1] and later expanded by Thomas Cover.^[2] It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors. If k = 1, then the output is simply assigned to the value of that single nearest neighbor.

k-NN is a type of classification where the function is only approximated locally and all computation is deferred until function evaluation. Since this algorithm relies on distance for classification, if the features represent different physical units or come in vastly different scales then normalizing the training data can improve its accuracy dramatically.^[3]

Both for classification and regression, a useful technique can be to assign weights to the contributions of the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones. For example, a common weighting scheme consists in giving each neighbor a weight of 1/d, where d is the distance to the neighbor.^[4]

The neighbors are taken from a set of objects for which the class (for k-NN classification) or the object property value (for k-NN regression) is known. This can be thought of as the training set for the algorithm, though no explicit training step is required.

A peculiarity of the k-NN algorithm is that it is sensitive to the local structure of the data.

Statistical setting

Suppose we have pairs $(X_{1},Y_{1}),(X_{2},Y_{2}),\dots ,(X_{n},Y_{n})$ taking values in $\mathbb {R} ^{d}\times \{1,2\}$ , where $Y$ is the class label of $X$ , so that $X|Y=r\sim P_{r}$ for $r=1,2$ (and probability distributions $P_{r}$ ). Given some norm $\|\cdot \|$ on $\mathbb {R} ^{d}$ and a point $x\in \mathbb {R} ^{d}$ , let $(X_{(1)},Y_{(1)}),\dots ,(X_{(n)},Y_{(n)})$ be a reordering of the training data such that $\|X_{(1)}-x\|\leq \dots \leq \|X_{(n)}-x\|$ .

Algorithm

Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If *k = 3* (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If *k = 5* (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle).

The training examples are vectors in a multidimensional feature space, each with a class label. The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples.

In the classification phase, k is a user-defined constant, and an unlabeled vector (a query or test point) is classified by assigning the label which is most frequent among the k training samples nearest to that query point.

A commonly used distance metric for continuous variables is Euclidean distance. For discrete variables, such as for text classification, another metric can be used, such as the overlap metric (or Hamming distance). In the context of gene expression microarray data, for example, k-NN has been employed with correlation coefficients, such as Pearson and Spearman, as a metric.^[5] Often, the classification accuracy of k-NN can be improved significantly if the distance metric is learned with specialized algorithms such as Large Margin Nearest Neighbor or Neighbourhood components analysis.

A drawback of the basic "majority voting" classification occurs when the class distribution is skewed. That is, examples of a more frequent class tend to dominate the prediction of the new example, because they tend to be common among the k nearest neighbors due to their large number.^[6] One way to overcome this problem is to weight the classification, taking into account the distance from the test point to each of its k nearest neighbors. The class (or value, in regression problems) of each of the k nearest points is multiplied by a weight proportional to the inverse of the distance from that point to the test point. Another way to overcome skew is by abstraction in data representation. For example, in a self-organizing map (SOM), each node is a representative (a center) of a cluster of similar points, regardless of their density in the original training data. K-NN can then be applied to the SOM.

Parameter selection

The best choice of k depends upon the data; generally, larger values of k reduces effect of the noise on the classification,^[7] but make boundaries between classes less distinct. A good k can be selected by various heuristic techniques (see hyperparameter optimization). The special case where the class is predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor algorithm.

The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant features, or if the feature scales are not consistent with their importance. Much research effort has been put into selecting or scaling features to improve classification. A particularly popular^{[citation needed]} approach is the use of evolutionary algorithms to optimize feature scaling.^[8] Another popular approach is to scale features by the mutual information of the training data with the training classes.^{[citation needed]}

In binary (two class) classification problems, it is helpful to choose k to be an odd number as this avoids tied votes. One popular way of choosing the empirically optimal k in this setting is via bootstrap method.^[9]

The $1$ -nearest neighbor classifier

The most intuitive nearest neighbour type classifier is the one nearest neighbour classifier that assigns a point $x$ to the class of its closest neighbour in the feature space, that is $C_{n}^{1nn}(x)=Y_{(1)}$ .

As the size of training data set approaches infinity, the one nearest neighbour classifier guarantees an error rate of no worse than twice the Bayes error rate (the minimum achievable error rate given the distribution of the data).

The weighted nearest neighbour classifier

The $k$ -nearest neighbour classifier can be viewed as assigning the $k$ nearest neighbours a weight $1/k$ and all others $0$ weight. This can be generalised to weighted nearest neighbour classifiers. That is, where the $i$ th nearest neighbour is assigned a weight $w_{ni}$ , with ${\textstyle \sum _{i=1}^{n}w_{ni}=1}$ . An analogous result on the strong consistency of weighted nearest neighbour classifiers also holds.^[10]

Let $C_{n}^{wnn}$ denote the weighted nearest classifier with weights $\{w_{ni}\}_{i=1}^{n}$ . Subject to regularity conditions, which in asymptotic theory are conditional variables which require assumptions to differentiate among parameters with some criteria. On the class distributions the excess risk has the following asymptotic expansion^[11]