The SDL Component Suite is an industry leading collection of components supporting scientific and engineering computing. Please visit the SDL Web site for more information....



AgglomClustering


Unit:SDL_math2
Class: none
Declaration: function AgglomClustering (Sender: TObject; InMat: TMatrix; DistanceMeasure: TDistMode; ClusterMethod: TClusterMethod; alpha: double; var ClustResult: TIntMatrix; var ClustDist: TVector; var DendroCoords: TVector; Feedback: TFeedbackProc; OnDistCalc : TOnCalcDistanceEvent): integer;

The function AgglomClustering performs an agglomerative hierarchical cluster analysis on data contained in matrix InMat. Each data object is represented as one row of the matrix, the columns are forming the variables. The parameter DistanceMeasure specifies the type of distance measurement.

The parameter ClusterMethod specifies the type of clustering method used. If cmFlexLink is used as clustering method, the parameter alpha has to be additionally specified. Alpha may take any value between 0.5 and 1.0. A value of 0.5 results in an average linkage clustering (cmAvgLink). Higher values increase the divisive effects of the clustering process. Usually a value between 0.6 and 0.7 is preferred.

The result of the clustering process is returned in the parameters ClustResult, ClustDist, and DendroCoords. The integer array ClustResult contains the clustering information, describing which clusters (or objects) are joined to form a new cluster. This matrix consists of InMat.NrOfRows-1 rows and three columns. The rows are ordered by increasing cluster distance, which is stored in the parameter ClustDist. The parameter Sender contains the object which called AgglomClustering; it is used by the callback routine specified by the parameter Feedback. For simple applications (and small data sets) these two parameters may be set to NIL. The parameter OnDistCalc can be used to pass an event routine to the subroutine Matrix.CalcDist which is called internally in order to calculate the object distances. The OnDistCalc event is triggered only if the parameter DistanceMeasure is set to dmUserDef.

An example should clarify the situation. The results of the cluster analysis shown below have been obtained from a set of 20 observations (objects) with four variables by applying Ward's algorithm (ClusterMethod = cmWard) to it.

   ------------- ClustResult --------------        ClustDist 
   number of       number of       number of 
   cluster 1       cluster 2       new cluster     distance 
   ----------------------------------------------------------- 
   2               19              21              5.0945 
   1               16              22              5.3573 
   3               6               23              7.2815 
   9               10              24              10.2774 
   8               14              25              10.6847 
   12              18              26              13.0239 
   4               25              27              13.5628 
   24              15              28              16.0441 
   5               13              29              16.5704 
   7               17              30              19.2583 
   23              27              31              24.1079 
   11              29              32              24.2236 
   26              20              33              24.6635 
   22              21              34              26.9456 
   31              34              35              39.2175 
   32              28              36              52.7880 
   36              30              37              90.4147 
   35              33              38              109.4378 
   37              38              39              315.1660 
   ----------------------------------------------------------- 

The table above is to interpret as follows: clusters (objects) 2 and 19 are joined to form the new cluster 21; the distance between the two original clusters is 5.09. Next, clusters 1 and 16 are joined to form cluster 22 at a distance of 7.28, and so on. Note that any cluster numbers below or equal to InMat.NrOfRows designate the original objects, whereas higher numbers designate clusters built up of other objects and/or clusters. The results of a cluster analysis are normally displayed as a dendrogram:

DENDRO.gif

In order to facilitate the drawing of a dendrogram, the parameter DendroCoords (a vector of 2*InMat.NrOfRows -1 elements) contains the coordinates of the lines of the corresponding dendrogram. The first InMat.NrOfRows coordinates are those of the objects, the rest refer to the clusters as numbered in the matrix ClustResult (see the example program CLUSTER on details how to use the array DendroCoords).



Last Update: 2008-Oct-29