DataLab is a compact statistics package aimed at exploratory data analysis. Please visit the DataLab Web site for more information....

## KNN

 Command: Math -> K-Nearest Neighbors...

The command Math/KNN provides the facility of building KNN-models and applying them to unknown data. The user has the choice of several methods for estimating the results from the nearest neighbors. Although the KNN-method normally is used for classification purposes only, DataLab makes an attempt to exploit the ideas behind KNN also for the estimation of continuous properties.

The basic approach to KNN-modelling is first to compile the data which should serve as a model, then to build a model from the data, and finally to apply this model to unknown data.

The number of neighbors can be adjusted by the scrollbar between 1 and 50. Please note that for majority voting the number of neighbors should be odd.

The weighting mode determines the procedure for calculating the estimated value from its nearest neighbors. DataLab provides three methods for estimating the unknown values: (1) by calculating the average, (2) by performing a majority voting amongst the nearest neighbors, and (3) by building a local linear regression model.

Majority voting is normally used for classification purposes only. In this case that class is assigned to the unknown which has a majority among the classes of the k nearest neighbors. This concept can be extended to the continuous approximation case by introducing density distribution estimators of the target values among the set of nearest neighbors. DataLab provides such an estimation of density distributions. Thus majority voting can also be applied to continuous data, though a better way to estimate continuous data from KNN models is to use local regression models.

The local linear regression is a simple method of estimating non-linear functional dependencies by the combination of KNN and multiple linear regression. The idea behind this method is simple: the nearest neighbors found for a given unknown data point are used to set up a linear model by the use of multiple linear regression. This model is then used to predict the target value of the unknown. A natural prerequisite of this method is of course that the number of nearest neighbors determined must exceed the number of input variables of the model.

Building a model

The set-up of a KNN-model is straightforward. First the user has to compile the data which he wants to use for his model. This can be done, for example, by simply selecting a random sample from a given data set. Next, the user has to click Build New Model. Now he has first to select the input variables and then the target variable. Thereafter DataLab uses the selected variables and transfers these variables to a file which serves as a KNN-model.

Applying a model

In order to apply a KNN-model to unknown data some prerequisites must be met. DataLab assumes that the variables used in the model match the variables of the unknown data. This is ensured by comparing the names of the variables of the model and of the unknown data set. If no match can be established, DataLab issues a warning.

After clicking the command Apply Model, the user has to select the model. Thereafter DataLab displays the most important parameters of the chosen model and the user has to select the target variable to which the results of the model application should be transferred. Care should be taken not to overwrite an input variable of the KNN model, since this would prohibit any further application of the KNN model to the data set given.

Last Update: 2012-Aug-27