DataLab is a compact statistics package aimed at exploratory data analysis. Please visit the DataLab Web site for more information....

## Guided Tour: KNN classification

The k-nearest neighbor (kNN) method is usually used to carry out classifications. The classification is performed by using a reference data set which contains both the input and the target variables and comparing the unknown which contains only the input variables to that reference set. The distance of the unknown to the k nearest neighbors determines its class assignment by either averaging the class numbers of the k nearest reference points or by obtaining a majority vote from them. DataLab extends these concepts to be applicable to continuous-valued problems by introducing a combination of kNN and multiple linear regression. The target value of the unknown is estimated by setting up a multiple regression using the k nearest neighbors.

In order to show you how to perform a classification by using the kNN scheme, we should first load another data set (command Files/Load): FLURIEDW.IDT. This data set comprises geometric measures of 100 authentic and 100 counterfeit bank notes (the data have been supplied by courtesy of H. Riedwyl, University of Berne, Switzerland, see [Flury 83]). Now let's try to develop a classifier which relies on the kNN classification rule in order to discriminate between authentic and counterfeit notes.

When you have a look at the numeric values of this data set, you immediately see that some of the variables (e.g. 'Left', or 'Diagonal') exhibit quite a high mean compared to their standard deviation. So the first step would be to standardize the data matrix (mean = 0.0, standard deviation = 1.0) by using the command Math/Scaling/Standardize/Columns. This ensures that the distances calculated during the kNN procedure are not dominated by the high offsets of the variables.

Next, we have to create a variable which holds the class information. Therefore we extend our data matrix by one column (command Edit/Resize Data Matrix) and copy the class information vector to this additional empty variable (command Edit/Data/Copy/from ClInf). So you should end up with seven variables with the seventh holding the class information.

In order to have some means of checking the quality of our classifier, we split our data set into a training set and a test set. The training set is used to set up our classifier, whereas the test set is used later on to test the classifier. We therefore split the data set into two equally sized, randomly sampled, disjoint subsets by using the command Edit/Split Data...). The default parameters of the command Split Data... are quite suitable for our purpose; so you only have to start the subset creation by clicking Do It. This creates two disjoint data sets FLURIEDW01.ASC and FLURIEDW02.ASC (note that the subsets are always stored using the ASC format).

Next, we import the training data set FLURIEDW01.ASC (command File/Load/ASC Format) and create a kNN-model by using the variables 'Bottom', 'Top', and 'Diagonal'. Therefore you should click the command Math/KNN. This brings up the kNN window. Click the button Build New Model and select these three variables as input variables. When done, click OK and select the target variable, which is the class information (variable 'CLASSINFO'). DataLab now builds the kNN model and writes it to the disk after asking for a file name. This kNN model can now be used as a classifier to unknown data provided that the unknown data has been scaled in the same way as the model data. Since this is true for our 'unknown' data set FLURIEDW02.ASC we can immediately use it and apply our classifier to it.

Therefore load the data set FLURIEDW02.ASC by using the command File/Load/ASC Format). You can now apply the classifier to these data by using the command Math/KNN. But before that, you should increase the number of variables of the data matrix by five in order to get some empty columns for several classification runs (command Edit/Resize Data Matrix...).

The classification can be carried out by using several different system parameters (number of neighbors, weighting scheme). For the first trial let's classify the unknown by using a majority vote and 10 nearest neighbors. Therefore set the parameters Weighting Mode and No. Neighbors accordingly. Thereafter click Select Model and select the model FLURIEDW01.KNN; next click the button Apply Model. The system now indicates the input variables defined within the model and asks for the column where to store the classification result. Select one of the empty columns and DataLab will perform the classification writing the result of it to that empty column. You immediately see that most or all of the data are classified correctly. You can now for example try to find out whether the number of neighbors has an influence on the result. Set the number of neighbors to 3 and repeat the classification procedure.

Majority voting has the advantage that it results in an unambiguous class assignment of the unknown. However, this could also be a drawback since you don't know whether the majority vote was clear or just at the break point (say 4:5 in case of 9 nearest neighbors). In order to get some insight into the reliability of the kNN classification you could alternatively use averaging as weighting scheme. This would reveal the uncertainties of the vote, since the result of the classification would not be an integer class number but the average of all votes. So try again with 10 neighbors, setting the weighting mode to averaging. Now plot the classification result against the object number and mark all data values which do not exhibit integer class numbers (use the shortcut box of the window for marking the data). Thereafter you can scroll through the OBJECTS window where you find all marked data indicated by inverse video. You could now (if the 'unknown' data were really unknown) sit back and try some further investigations to get an idea on the true class assignment of the uncertain objects.

Last Update: 2012-Jul-25