Compute hierarchical or kmeans cluster analysis and return the group association for each observation as vector.

sjc.cluster(data, groupcount = NULL, method = c("hclust", "kmeans"),
  distance = c("euclidean", "maximum", "manhattan", "canberra", "binary",
  "minkowski"), agglomeration = c("ward", "ward.D", "ward.D2", "single",
  "complete", "average", "mcquitty", "median", "centroid"), iter.max = 20,
  algorithm = c("Hartigan-Wong", "Lloyd", "MacQueen"))

Arguments

data

A data frame with variables that should be used for the cluster analysis.

groupcount

Amount of groups (clusters) used for the cluster solution. May also be a set of initial (distinct) cluster centres, in case method = "kmeans" (see kmeans for details on centers argument). If groupcount = NULL and method = "kmeans", the optimal amount of clusters is calculated using the gap statistics (see sjc.kgap). For method = "hclust", groupcount needs to be specified. Following functions may be helpful for estimating the amount of clusters:

  • Use sjc.elbow to determine the group-count depending on the elbow-criterion.

  • If method = "kmeans", use sjc.kgap to determine the group-count according to the gap-statistic.

  • If method = "hclust" (hierarchical clustering, default), use sjc.dend to inspect different cluster group solutions.

  • Use sjc.grpdisc to inspect the goodness of grouping (accuracy of classification).

method

Method for computing the cluster analysis. By default ("kmeans"), a kmeans cluster analysis will be computed. Use "hclust" to compute a hierarchical cluster analysis. You can specify the initial letters only.

distance

Distance measure to be used when method = "hclust" (for hierarchical clustering). Must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". See dist. If is method = "kmeans" this argument will be ignored.

agglomeration

Agglomeration method to be used when method = "hclust" (for hierarchical clustering). This should be one of "ward", "single", "complete", "average", "mcquitty", "median" or "centroid". Default is "ward" (see hclust). If method = "kmeans" this argument will be ignored. See 'Note'.

iter.max

Maximum number of iterations allowed. Only applies, if method = "kmeans". See kmeans for details on this argument.

algorithm

Algorithm used for calculating kmeans cluster. Only applies, if method = "kmeans". May be one of "Hartigan-Wong" (default), "Lloyd" (used by SPSS), or "MacQueen". See kmeans for details on this argument.

Value

The group classification for each observation as vector. This group classification can be used for sjc.grpdisc-function to check the goodness of classification. The returned vector includes missing values, so it can be appended to the original data frame data.

Note

Since R version > 3.0.3, the "ward" option has been replaced by either "ward.D" or "ward.D2", so you may use one of these values. When using "ward", it will be replaced by "ward.D2".

To get similar results as in SPSS Quick Cluster function, following points have to be considered:

  1. Use the /PRINT INITIAL option for SPSS Quick Cluster to get a table with initial cluster centers.

  2. Create a matrix of this table, by consecutively copying the values, one row after another, from the SPSS output into a matrix and specify nrow and ncol arguments.

  3. Use algorithm="Lloyd".

  4. Use the same amount of iter.max both in SPSS and this sjc.qclus.

This ensures a fixed initial set of cluster centers (as in SPSS), while kmeans in R always selects initial cluster sets randomly.

References

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2014) cluster: Cluster Analysis Basics and Extensions. R package.

Examples

# Hierarchical clustering of mtcars-dataset groups <- sjc.cluster(mtcars, 5) # K-means clustering of mtcars-dataset groups <- sjc.cluster(mtcars, 5, method="k")