| Title: | Minimization of the Convex Clustering Loss Function |
|---|---|
| Description: | Implements the convex clustering through majorization-minimization (CCMM) algorithm described in Touw, Groenen, and Terada (2022) <doi:10.48550/arXiv.2211.01877> to perform minimization of the convex clustering loss function. |
| Authors: | Daniel Touw [aut, cre] (ORCID: <https://orcid.org/0000-0003-3074-5401>), Patrick Groenen [aut] (ORCID: <https://orcid.org/0000-0001-6683-8971>), Yoshikazu Terada [aut] |
| Maintainer: | Daniel Touw <[email protected]> |
| License: | GPL (>= 3) |
| Version: | 0.2.1 |
| Built: | 2026-05-13 06:11:39 UTC |
| Source: | https://github.com/djwtouw/ccmmr |
cvxclust object into an hclust objectConverts the output of convex_clustering or convex_clusterpath into a hclust object. Note that a step in the clusterpath from one value for lambda to the next may cause the number of clusters to decrease by more than one. It is a hard requirement that the clusterpath ends in a single cluster, as standard dendrogram plotting methods fail if this is not the case.
## S3 method for class 'cvxclust' as.hclust(x, ...)## S3 method for class 'cvxclust' as.hclust(x, ...)
x |
A |
... |
Unused. |
A hclust object.
# Demonstration of converting a clusterpath into a dendrogram, first generate # data set.seed(6) X = matrix(rnorm(14), ncol = 2) y = rep(1, nrow(X)) # Get sparse distances in dictionary of keys format with k = 3 W = sparse_weights(X, 3, 4.0) # Sequence for lambda lambdas = seq(0, 45, 0.02) # Compute results res = convex_clusterpath(X, W, lambdas) # Generate hclust object hcl = as.hclust(res) hcl$height = sqrt(hcl$height) # Plot clusterpath and dendrogram oldpar = par(mfrow=c(1, 2)) plot(res, y, label = c(1:7)) plot(hcl, ylab = expression(sqrt(lambda)), xlab = NA, sub = NA, main = NA, hang = -1) par(oldpar)# Demonstration of converting a clusterpath into a dendrogram, first generate # data set.seed(6) X = matrix(rnorm(14), ncol = 2) y = rep(1, nrow(X)) # Get sparse distances in dictionary of keys format with k = 3 W = sparse_weights(X, 3, 4.0) # Sequence for lambda lambdas = seq(0, 45, 0.02) # Compute results res = convex_clusterpath(X, W, lambdas) # Generate hclust object hcl = as.hclust(res) hcl$height = sqrt(hcl$height) # Plot clusterpath and dendrogram oldpar = par(mfrow=c(1, 2)) plot(res, y, label = c(1:7)) plot(hcl, ylab = expression(sqrt(lambda)), xlab = NA, sub = NA, main = NA, hang = -1) par(oldpar)
Get a particular clustering of the data. If there is a
clustering for n_clusters, it is returned, otherwise the function will
stop with a message. To know whether a query is going to be successful
beforehand, check the num_clusters attribute of the cvxclust
object, this lists all possible options for the number of clusters.
clusters(obj, n_clusters)clusters(obj, n_clusters)
obj |
A |
n_clusters |
An integer that specifies the number of clusters that should be returned. |
A vector with the cluster labels for each object in the data.
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute results CMM res = convex_clusterpath(X, W, lambdas) # Get labels for three clusters labels = clusters(res, 3)# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute results CMM res = convex_clusterpath(X, W, lambdas) # Get labels for three clusters labels = clusters(res, 3)
convex_clustering attempts to find the number of clusters
specified by the user by means of convex clustering. The algorithm looks for
each number of clusters between target_low and target_high. If
target_low = target_high, the algorithm searches for a single
clustering. It is recommended to specify a range around the desired number of
clusters, as not each number of clusters between 1 and nrow(X) may be
attainable due to numerical inaccuracies.
convex_clustering( X, W, target_low, target_high = NULL, max_iter_phase_1 = 2000, max_iter_phase_2 = 20, lambda_init = 0.01, factor = 0.025, tau = 0.001, center = TRUE, scale = TRUE, eps_conv = 1e-06, burnin_iter = 25, max_iter_conv = 5000, save_clusterpath = FALSE, verbose = 0 )convex_clustering( X, W, target_low, target_high = NULL, max_iter_phase_1 = 2000, max_iter_phase_2 = 20, lambda_init = 0.01, factor = 0.025, tau = 0.001, center = TRUE, scale = TRUE, eps_conv = 1e-06, burnin_iter = 25, max_iter_conv = 5000, save_clusterpath = FALSE, verbose = 0 )
X |
An |
W |
A |
target_low |
Lower bound on the number of clusters that should be
searched for. If |
target_high |
Upper bound on the number of clusters that should be
searched for. Default is |
max_iter_phase_1 |
Maximum number of iterations to find an upper and lower bound for the value for lambda for which the desired number of clusters is attained. Default is 2000. |
max_iter_phase_2 |
Maximum number of iterations to to refine the upper and lower bounds for lambda. Default is 20. |
lambda_init |
The first value for lambda other than 0 to use for convex clustering. Default is 0.01. |
factor |
The percentage by which to increase lambda in each step. Default is 0.025. |
tau |
Parameter to compute the threshold to fuse clusters. Default is 0.001. |
center |
If |
scale |
If |
eps_conv |
Parameter for determining convergence of the minimization. Default is 1e-6. |
burnin_iter |
Number of updates of the loss function that are done without step doubling. Default is 25. |
max_iter_conv |
Maximum number of iterations for minimizing the loss function. Default is 5000. |
save_clusterpath |
If |
verbose |
Verbosity of the information printed during clustering. Default is 0, no output. |
A cvxclust object containing the following
info |
A dataframe containing for each value for lambda: the number of different clusters, and the value of the loss function at the minimum. |
merge |
The merge table containing the order at which the
observations in |
height |
The value for lambda at which each reduction in the number of clusters occurs. |
order |
The order of the observations in |
elapsed_time |
The number of seconds that elapsed while
running the code. Note that this does not include the time required for
input checking and possibly scaling and centering |
coordinates |
The clusterpath coordinates. Only part of the
output in case that |
lambdas |
The values for lambda for which a clustering was found. |
eps_fusions |
The threshold for cluster fusions that was used by the algorithm. |
phase_1_instances |
The number of instances of the loss function
that were minimized while finding an upper and lower bound for lambda. The
sum |
phase_2_instances |
The number of instances of the loss function
that were minimized while refining the value for lambda. The sum
|
num_clusters |
The different numbers of clusters that have been found. |
n |
The number of observations in |
convex_clusterpath, sparse_weights
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse weights in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Perform convex clustering with a target number of clusters res1 = convex_clustering(X, W, target_low = 2, target_high = 5) # Plot the clustering for 2 to 5 clusters oldpar = par(mfrow=c(2, 2)) plot(X, col = clusters(res1, 2), main = "2 clusters", pch = 19) plot(X, col = clusters(res1, 3), main = "3 clusters", pch = 19) plot(X, col = clusters(res1, 4), main = "4 clusters", pch = 19) plot(X, col = clusters(res1, 5), main = "5 clusters", pch = 19) # A more generalized approach to plotting the results of a range of clusters res2 = convex_clustering(X, W, target_low = 2, target_high = 7) # Plot the clusterings k = length(res2$num_clusters) par(mfrow=c(ceiling(k / ceiling(sqrt(k))), ceiling(sqrt(k)))) for (i in 1:k) { labels = clusters(res2, res2$num_clusters[i]) c = length(unique(labels)) plot(X, col = labels, main = paste(c, "clusters"), pch = 19) } par(oldpar)# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse weights in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Perform convex clustering with a target number of clusters res1 = convex_clustering(X, W, target_low = 2, target_high = 5) # Plot the clustering for 2 to 5 clusters oldpar = par(mfrow=c(2, 2)) plot(X, col = clusters(res1, 2), main = "2 clusters", pch = 19) plot(X, col = clusters(res1, 3), main = "3 clusters", pch = 19) plot(X, col = clusters(res1, 4), main = "4 clusters", pch = 19) plot(X, col = clusters(res1, 5), main = "5 clusters", pch = 19) # A more generalized approach to plotting the results of a range of clusters res2 = convex_clustering(X, W, target_low = 2, target_high = 7) # Plot the clusterings k = length(res2$num_clusters) par(mfrow=c(ceiling(k / ceiling(sqrt(k))), ceiling(sqrt(k)))) for (i in 1:k) { labels = clusters(res2, res2$num_clusters[i]) c = length(unique(labels)) plot(X, col = labels, main = paste(c, "clusters"), pch = 19) } par(oldpar)
Minimizes the convex clustering loss function for a given set of values for lambda.
convex_clusterpath( X, W, lambdas, tau = 0.001, center = TRUE, scale = TRUE, eps_conv = 1e-06, burnin_iter = 25, max_iter_conv = 5000, save_clusterpath = TRUE, target_losses = NULL, save_losses = FALSE, save_convergence_norms = FALSE )convex_clusterpath( X, W, lambdas, tau = 0.001, center = TRUE, scale = TRUE, eps_conv = 1e-06, burnin_iter = 25, max_iter_conv = 5000, save_clusterpath = TRUE, target_losses = NULL, save_losses = FALSE, save_convergence_norms = FALSE )
X |
An |
W |
A |
lambdas |
A vector containing the values for the penalty parameter. |
tau |
Parameter to compute the threshold to fuse clusters. Default is 0.001. |
center |
If |
scale |
If |
eps_conv |
Parameter for determining convergence of the minimization. Default is 1e-6. |
burnin_iter |
Number of updates of the loss function that are done without step doubling. Default is 25. |
max_iter_conv |
Maximum number of iterations for minimizing the loss function. Default is 5000. |
save_clusterpath |
If |
target_losses |
The values of the loss function that are used to
determine convergence of the algorithm (tested as: loss - target <=
|
save_losses |
If |
save_convergence_norms |
If |
A cvxclust object containing the following
info |
A dataframe containing for each value for lambda: the number of different clusters, and the value of the loss function at the minimum. |
merge |
The merge table containing the order at which the
observations in |
height |
The value for lambda at which each reduction in the number of clusters occurs. |
order |
The order of the observations in |
elapsed_time |
The number of seconds that elapsed while
running the code. Note that this does not include the time required for
input checking and possibly scaling and centering |
coordinates |
The clusterpath coordinates. Only part of the
output in case that |
lambdas |
The values for lambda for which a clustering was found. |
eps_fusions |
The threshold for cluster fusions that was used by the algorithm. |
num_clusters |
The different numbers of clusters that have been found. |
n |
The number of observations in |
losses |
Optional: if |
convergence_norms |
Optional: if
|
convex_clustering, sparse_weights
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse weights in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute clusterpath res = convex_clusterpath(X, W, lambdas) # Get cluster labels for two clusters labels = clusters(res, 2) # Plot the clusterpath with colors based on the cluster labels plot(res, col = labels)# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse weights in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute clusterpath res = convex_clusterpath(X, W, lambdas) # Get cluster labels for two clusters labels = clusters(res, 2) # Plot the clusterpath with colors based on the cluster labels plot(res, col = labels)
Plot a clusterpath for two-dimensional data.
## S3 method for class 'cvxclust' plot(x, col = NULL, labels = NULL, ...)## S3 method for class 'cvxclust' plot(x, col = NULL, labels = NULL, ...)
x |
A |
col |
A vector containing cluster membership information. Default is
|
labels |
A vector containing labels for each object. Default is
|
... |
Further graphical parameters. |
A plot in the console.
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute results CMM res = convex_clusterpath(X, W, lambdas) plot(res, y + 1)# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0) # Set a sequence for lambda lambdas = seq(0, 2400, 1) # Compute results CMM res = convex_clusterpath(X, W, lambdas) plot(res, y + 1)
Construct a sparse weight matrix in a dictionary-of-keys format.
Each nonzero weight is computed as , where
the squared Euclidean distance may be scaled by the average squared Euclidean
distance, depending on the argument scale. Sparsity is achieved by
only setting weights to nonzero values that correspond to two objects that
are among each other's nearest neighbors.
sparse_weights( X, k, phi, connected = TRUE, scale = TRUE, connection_type = "SC" )sparse_weights( X, k, phi, connected = TRUE, scale = TRUE, connection_type = "SC" )
X |
An |
k |
The number of nearest neighbors to be used for non-zero weights. |
phi |
Tuning parameter of the Gaussian weights. Input should be a nonnegative value. |
connected |
If |
scale |
If |
connection_type |
Determines the method to ensure a connected weight
matrix if |
A sparseweights object containing the nonzero weights in
dictionary-of-keys format.
# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0)# Load data data(two_half_moons) data = as.matrix(two_half_moons) X = data[, -3] y = data[, 3] # Get sparse distances in dictionary of keys format with k = 5 and phi = 8 W = sparse_weights(X, 5, 8.0)
A dataset containing 150 observations generated according to the two interlocking half moons data generating process. The first two columns contain the x and y-coordinates and the third column contains the cluster ID. Each moon contains 75 observations.
data(two_half_moons)data(two_half_moons)
An object of class data.frame with 150 rows and 3 columns.