DataLab is a compact statistics package aimed at exploratory data analysis. Please visit the DataLab Web site for more information....

## Guided Tour: Multiple Linear Regression

To start with multiple linear regression (MLR), let's first load the data file BOILPTS.IDT. You may remember from a previous section that this file contains 185 objects of 13 variables each. The data describe the normal boiling points of 185 chemical compounds and some structural descriptors of these compounds. Now let's try to find an answer to the question whether it is possible to estimate the boiling points from the structural descriptors by using MLR.

For a first trial just select the command Math/Multiple Linear Regression/Calculate Model... (toolbar button ). The window that appears provides you with several command buttons, among which is the command Calculate which is only enabled if you have selected both the independent variables and the target variable.

Let's perform a first attempt by selecting as an example the variables 4, 6, and 8 (nHetAt, toporad, and n-branch) as input variables, and variable 13 (the boiling points) as target variable. For that purpose first click into the list of descriptors and select the desired variables. Next, click the "Dependent Variable" field and select the boiling point as the target variable. In order to calculate the regression press the "Calculate" button.

The results are displayed in three switchable windows:

• actual vs. estimated values
• distribution of the residuals
• residuals
More details on the calculated MLR model can be found in the protocol (button ). For an ideal case the data points in the diagram "actual vs. estimated values" should be close to the inserted straight line. Anyway, in our example the reality is far from being ideal as can be seen from the screenshot below. Try another combination of variables! Do you get better results?

You might wonder how to find out the best combination of variables, since the number of possible combinations is quite large in our example (in general there are 2p-1 combinations for p independent variables, which results in 4095 combinations in our particular case). In principle, there are several ways of selecting a more or less adequate combination of variables: e.g. stepwise regression, backward elimination, forward selection, or just trying all possible combinations. DataLab provides all of these methods; use the command Math/Multiple Linear Regression/Variable Selection or the toolbar button to start the variable selection process.

Now, try to start the forward selection. For that purpose specify the target variable by ticking off the variable 13 (boil. point) in the third column. After clicking the start button a list of sub-models is displayed. The best model is indicated by a black bar. This model uses the variables 10,2,8,12, and 5 as independent variables.

Now click the button in order to copy the selected variables into the MLR window, and start the regression calculation once again. The new model delivers much improved results showing a standard deviation of the residuals of 7.45°C and a coefficient of determination of 0.9767.

Last Update: 2012-Jul-25