DataLab is a compact statistics package aimed at exploratory data analysis. Please visit the DataLab Web site for more information....



Guided Tour: Principal component analysis

Next, we'll have a look at principal component analysis (PCA). PCA is based on the assumption that the direction of the largest variance in the data carries most of the information. Therefore the n-dimensional data space is rotated in such a way that the directions of the largest variance become the coordinate axes of the data space. The resulting new axes (principal components) are normal to each other and are sorted by decreasing variance. Thus the first principal component shows the maximum variance in the data. The mathematical details are beyond the scope of this manual and the interested reader should refer to the standard text books, e. g. [Jolliffe 86] or [Flury 83].

To start with, let's first load another data set: WINE.IDT. This data set is derived from a chemical analysis of three different kinds of Italian wines (Barolo, Grignolino, Barbera). A total of 178 samples have been analyzed for 13 chemical or physical parameters [Forina 82].

In order to get an overview on multivariate data, it is normally not sufficient simply to plot any two variables against each other. There is the primary question, which variables to select for the plot Ä the number of possible combinations could become quite large when the number of variables is large. Another problem commonly met with multivariate data is that the variables are usually correlated to each other, which of course reduces the amount of information which can be drawn from such a plot. One solution for these problems can be found in principal component analysis.

So let's calculate the principal components and look at the data using the PCA scores. Therefore click the command Math/PCA and select the autoscaling version (command m=0.0 s=1.0). The principal components (PCs) are calculated within a few seconds and now you are asked whether you wish to replace the original data by the PCA scores (click No) and whether you wish to create a result file on the disk (click No again). At this point you see a plot and a table of the eigenvalues normalized to a sum of 100%:

This gives you a first information on the structure of data. The more eigenvectors you need to describe all the variance in the data, the higher is the intrinsic dimensionality of your data. In our case we need eight eigenvectors to describe more than 90 % of the variance (which is quite a high number of principal components when compared to number of original variables).

Now that the PCs have been calculated, you can display the principal component scores and loadings. The scores are basically the projection of the data to the new coordinate system which is spanned by the eigenvectors.

The loadings define the size of the contribution of each original variable to the PCs. In order to get an overview of the data, you should look at the scores/scores plot of the two most important principal components. In addition, the loadings/loadings plot will give you an overview of the importance of your original variables. Therefore we activate the command Chart/PC Scr-Scr and select one of the lower windows to be used for that plot. Then we select the first two PCs to be plotted against each other. You should also issue the command Chart/PC Load- Load for the other lower window. When you look at the scores/scores plot you immediately see that there are three clusters of data (Fig. 3.19), which is not surprinsing since the data describe three different kinds of wine.

Now, let's switch on the display of the class information which is supplied with the data and which corresponds to the kind of wine. Therefore activate the shortcut box of the PC scores/scores plot by clicking the small '+' sign at the upper right corner and click the command CC ON. The PC scores plot will be redrawn with the class colors turned on. The result fully supports our first impression of having three clusters in the data. Now you can see that the clusters do not overlap too much and therefore it should be possible to create a classifier which can distinguish the three kinds of wines based on a chemical analysis. When you look at the loadings/loadings plot you see that all of the original variables have comparable importance for the description of the first two principal components (no loading vector has a near-zero element for both PCs).

Next, let's have a quick look at the influence of the scaling of the data on the results of the PCA. Therefore we first replace the loading/loading plot by the loading vector of a single PC. Click Math/PC Load-Idx, select the window with the loading/loading plot and select the first PC for the plot. You will now see the loading vector with the elements of the vector drawn as lines. You will see that the first PC consists of a combination of at least 12 variables (12 non-zero vector elements). Note that you can switch between the principal components by simply clicking the window (right mouse button increases the index of the PC, left mouse button decreases it).

Next you should recalculate the principal components using a different scaling option. Select the command Math/PCA/m=0.0 and you will immediately see that in this case the first PC is set up by only one variable. The same is true for the second PC. The reason for this is that the variance in these two variables exceeds by far the variance of the other variables (use the numeric editor command Edit/Data/Numerical to inspect the data matrix). Since we have chosen mean-centered data as a basis for the PCA (covariance matrix), the variances of the data have not been scaled and therefore have a major influence on the PCA results. You may wonder what the results are if no scaling at all is applied before the PCA - well try it !

As you already might know, PCA creates an orthogonal coordinate system with linear independent axes. Therefore it is sometimes advantageous to replace the original data by their principal component scores (which means a rotation of the original data set) and work on with these scores. You may for example perform a cluster analysis based on the PCs, or look at the first three PCs by 3D- rotation. Using PCs can relieve you from selecting the 'best' variables for your problem, since the PCs are ordered according to the variance they exhibit. But you should be aware that PCs are not necessarily the best way to deal with multivariate data.

Anyway, for the end of this section, let's have a simultaneous look at the first three PCs by utilizing the 3D-rotation plot. Therefore recalculate the PCs with standardized data (command Math/PCA/m=0.0) and replace the original data by their principal components (click YES when you are asked whether to replace the original data). Now, you can start up the 3D-rotation to look at the first three principal components (command Chart/3D-rotation/Auto Rot). The three clusters should be visible even better than with using only a score/score plot.

Hint: Please keep in mind that the evaluation copy of DataLab does not allow to load data sets having more than 500 elements. Bigger data sets are available as partial data sets as well, which are indicated by the string "_500" in the filename; e.g. "wine.idt" (vollständiger Datensatz) und "wine_500.idt" (Teildatensatz).


Last Update: 2012-Jul-25