|DataLab is a compact statistics package aimed at exploratory data analysis. Please visit the DataLab Web site for more information....|
|Home Introduction Guided Tour Class Information|
|See also: Class Attributes
Guided Tour: Class Information
Exploratory data analysis is often guided by additional categorical information on the data. This additional information can be handled by assigning class numbers to the data samples. DataLab allows up to 127 different categories to be assigned to the data. This class information can be visualized either by using different colors or different symbols.
A short example will show how to utilize class information during data interpretation. First, let's load another set of data into DataLab: BOILPTS_NC.IDT, which consists of 55 objects and 9 variables. This data describes the normal boiling points of 55 chemical compounds and some structural descriptors of these compounds (some simple descriptors such as the number of oxygen or sulphur atoms, and some more sophisticated descriptors such as topological descriptors which have been deduced from graph theoretical considerations of molecular structures). Now let's have a look at the data and investigate whether there are any relationships between structural parameters and the boiling points of these substances.
After loading the file BOILPTS_NC.IDT (command File/Load/IDT-Format or shortcut button ) DataLab displays two diagram windows, showing the boiling point as a function of the variables 'Randic-Ix' and 'C-Atoms':
The plot of the boiling point vs. the number of carbon atoms shows that there is some relationship between them (the more carbon atoms the higher the boiling points) although the correlation is not too good. Another interesting relationship between the boiling point and the Randic index (variable "Randic-Ix") becomes evident in the other plot. Here you can see three bands each of which indicates a strong correlation between the boiling points and the Randic index. Of course one particular question arises immediately: what property is responsible for the three bands. In order to find out this property you may try to mark one of these bands to begin with and have a look at all the other variables to see whether the markings give any hints as to the origin of these bands.
Let's mark for example the middle band. Thereafter we use the other window to browse through all the variables and look for any apparent dependencies. When doing this, you will certainly be startled by the fact that nearly all of the marked objects of the middle band have exactly one sulphur atom (all corresponding markings appear in the 1-sulfur-atom region within the S-boil.point-plot).
Thus, we arrive at the hypothesis that the bands in the plot of boiling point vs. Randic index are caused by the number of sulfur atoms. In order to verify this, let's now use the concept of class information. We therefore copy the number of sulfur atoms to the class information vector by using the command Edit/Data/Copy/to Class Information. Select the variable entitled 'S-Atoms'. Now the class information is copied from the number of sulphur atoms, meaning that all substances with no sulfur atoms in it belong to class 0, compounds with 1 sulfur atom belong to class 1 and so on. The only thing we have yet to do, is to activate the color coding of the data in the plot of Randic index vs. boiling point. Therefore open the setup dialog () of this window and select 'Class Colors' in the 'Attributes' field. Now the classes are displayed as colored data points - and look, our hypothesis that the sulfur atoms could be responsible for the three bands in the relationship is verified.
And indeed, a far better model for the boiling points of these substances can be obtained by combining the number of sulfur atoms and the Randic index.
Some remarks still need to be made on the handling of class information: (1) Of course, you can indicate class information not only by different colors but also by different symbols. This may be important when creating black and white hardcopies (use the plot setup command to switch to character-based coding of the class information). (2) You can assign any color or symbol to any of your classes by using the commands Setup/Class Assignment/Colors () or Setup/Class Assignment/Symbols (). (3) Class numbers can be edited in a batch by utilizing the command Edit/Classes ().