evalclusters
Name, Value)Create a clustering evaluation object to find the optimal number of clusters.
evalclusters creates a clustering evaluation object to evaluate the
optimal number of clusters for data x, using criterion criterion.
The input data x is a matrix with n observations of p
variables.
The evaluation criterion criterion is one of the following:
CalinskiHarabasz to create a CalinskiHarabaszEvaluation object.
DaviesBouldin to create a DaviesBouldinEvaluation object.
gap to create a GapEvaluation object.
silhouette to create a SilhouetteEvaluation object.
The clustering algorithm clust is one of the following:
kmeans to cluster the data using kmeans with EmptyAction set to
singleton and Replicates set to 5.
linkage to cluster the data using clusterdata with linkage set to
Ward.
gmdistribution to cluster the data using fitgmdist with SharedCov set to
true and Replicates set to 5.
If the criterion is CalinskiHarabasz, DaviesBouldin, or
silhouette, clust can also be a function handle to a function
of the form c = clust(x, k), where x is the input data,
k the number of clusters to evaluate and c the clustering result.
The clustering result can be either an array of size n with k
different integer values, or a matrix of size n by k with a
likelihood value assigned to each one of the n observations for each
one of the k clusters. In the latter case, each observation is assigned
to the cluster with the higher value.
If the criterion is CalinskiHarabasz, DaviesBouldin, or
silhouette, clust can also be a matrix of size n by
k, where k is the number of proposed clustering solutions, so
that each column of clust is a clustering solution.
In addition to the obligatory x, clust and criterion inputs
there is a number of optional arguments, specified as pairs of Name
and Value options. The known Name arguments are:
KLista vector of positive integer numbers, that is the cluster sizes to evaluate. This option is necessary, unless clust is a matrix of proposed clustering solutions.
Distance a distance metric as accepted by the chosen clust. It can be the
name of the distance metric as a string or a function handle. When
criterion is silhouette, it can be a vector as created by
function pdist. Valid distance metric strings are: sqEuclidean
(default), Euclidean, cityblock, cosine,
correlation, Hamming, Jaccard.
Only used by silhouette and gap evaluation.
ClusterPriors the prior probabilities of each cluster, which can be either empirical
(default), or equal. When empirical the silhouette value is
the average of the silhouette values of all points; when equal the
silhouette value is the average of the average silhouette value of each
cluster. Only used by silhouette evaluation.
B the number of reference datasets generated from the reference distribution.
Only used by gap evaluation.
ReferenceDistribution the reference distribution used to create the reference data. It can be
PCA (default) for a distribution based on the principal components of
X, or uniform for a uniform distribution based on the range of
the observed data. PCA is currently not implemented.
Only used by gap evaluation.
SearchMethod the method for selecting the optimal value with a gap evaluation. It
can be either globalMaxSE (default) for selecting the smallest number
of clusters which is inside the standard error of the maximum gap value, or
firstMaxSE for selecting the first number of clusters which is inside
the standard error of the following cluster number.
Only used by gap evaluation.
Output eva is a clustering evaluation object.
See also: CalinskiHarabaszEvaluation, DaviesBouldinEvaluation, GapEvaluation, SilhouetteEvaluation
Source Code: evalclusters