Seedline Clustering of Genetic Similarity Matrix
This function creates a genetic similarity matrix and a two dimensional plot of the genotypes, then does clustering on the plots
- The data sets are stored on disk; below, you can choose among several.
- The molecular markers used are RFLPs, AFLPs, microsatellites (SSRs), and RAPDs.
- A brief explanations of seedline clustering is found below. Methods for genetic similarity and ordination (conversion to two-dimensional plots) are described here.
- The progress of the clustering process is shown by lines drawn on the 2-D images between the genotypes. Each plot is numbered, with increasing numbers of clustering lines shown.
- As a default, approximately 50 plots are drawn, starting at 0 and going through about 25% of the total number of plot lines that can be drawn. In practice this gives a good visualization of clusters for most data sets.
- The actual clustering process is separate from the plotting: results are saved as an internal file. All of the data sets have had default clustering done on them, so you need to do further clustering only for special purposes.
- Clustering takes time: it is an O(N2M) process, with N the number of genotypes and M being the number of clustering repetitions. Empirically, 300 reps (the default) usally gives stable results.
Show Clustering Results
Perform Additional Clusterings
Brief explanation of seedline clustering
- The most commonly used clustering method, UPGMA, starts by combining the two closest (most similar) genotypes into a cluster, then calculating the distance from this group to all other genotypes (and groups) as the average distance from each of the group's members. Then, the next closest genotypes are combined, and the process is repeated until all genotypes have been joined into a tree. This is a hierarchical, agglomerative clustering process. UPGMA clustering produces a tree diagram, showing the order in which genotypes join various clusters and the average similarity between members of the cluster.
- Another useful approach to clustering is ordination, producing a two (or three) dimensional plot in which the positions of the genotypes reflect their approximate distances from each other. Such plots are usually only approximate, because genetic similarity between a group of genotypes does not generally fit into two-dimensional Euclidean space. Also, it is difficult to recognize groups on most such plots: the spaces between groups are usually not clearly larger than the spaces within the groups. However, two-dimensional plots are much easier to comprehend than tree diagrams.
- Seedline clustering tries to combine a formal clustering algorithm with a two-dimensional plot, to allow the clustering process to be readily visualized. Seedline clustering is a method to find areas of high density in the genetic similarity matrix and highlight these areas on the plot diagram.
- The seedline clustering algorithm is a generalization of UPGMA clustering. As in UPGMA, groups are formed by the combination of individual genotypes and/or pre-existing groups. The distance between a group and another group or genotype is the average distance between all the members of the two groups.
- Seedline clustering begins with picking a random collection of genotypes to act as "seeds" for clusters. It is more efficient to use seeds that are relatively distant from each other, and in its first trial the algorithm attempts to choose a set of seeds that are separated by at least the average distance between all genotypes. However, this constraint is automatically relaxed if necessary. The number of seeds is arbitrary: a number equal to the square root of the number of genotypes being clustered works well; this is a "stringency" of 1.0. Increasing the stringency (increasing the number of seed lines) increases the sharpness of the clusters, at the expense of leaving more lines unclustered.
- The algorithm proceeds through a series of cycles in which the unclustered genotype nearest to a seed line is added to that seed line's group. Only individual genotypes are added to clusters; clusters are never combined. Eventually, all genotypes have been added to one of the clusters. The number of clusters is equal to the number of seed lines. Once all genotypes have been clustered, a record is made of which genotypes fell into the same cluster.
- The clustering cycle is then repeated, using a new set of seeds and a new bootstrapped version of the distance matrix. Between 100 and 1000 repetitions of the clustering gives stable results. After every repetition, the record of which genotypes fell into the same groups is updated.
- The result of these repeated clustering cycles is the clustering list, a list of the number of times each pair of genotypes fell into the same cluster. Very similar genotypes usually cluster together, and very dissimilar genotypes rarely or never cluster together.
- The sorted clustering list can be used in conjunction with the plot diagram. Lines between genotypes that cluster together are drawn of the plot, in the order of their frequency of clustering together. This process produces a series of plots, with lines drawn between the genotypes. By examining these plots in order, the development of clusters can be seen. At some point between 10% and 20% of the plot line have been drawn, most genotypes appear to be part of a cluster that is not connected with any other cluster. It is quite easy at this point to group the lines in each cluster. These groups are analogous to groups seen in tree plots. In maize, the groups seen with seedline clustering coorespond well to heterotic groups identified by maize breeders.