Become an expert in R — Interactive courses, Cheat Sheets, certificates and more!
Get Started for Free

Cluster

K-means Clustering


Description

This function computes cluster centers and their time series of occurrences, with the K-means clustering method using Euclidean distance, of an array of input data with any number of dimensions, one of them (the 'posdates'th) corresponding to time. By default the first dimension is expected to correspond to time. Specifically, it partitions the array along time axis in K groups or clusters in which each space vector/array belongs to (i.e., is a member of) the cluster with the nearest center or centroid. This function relies on the NbClust package (Charrad et al., 2014 JSS).

Usage

Cluster(var, weights, nclusters = NULL, index = "sdindex", posdates = 1)

Arguments

var

An array with any number of dimensions, one of them (the 'posdates'th) corresponding to time with either area-averages over a series of domains or the grid points for any sptial grid structure (x), (y), (z), (x,y), (x,y,z), (y,z), ...

weights

A vector/array of multiplicative weights based on the areas covering each domain/region or grid-cell of var; the dimensions of weights vector must be equal to the dimensions of 'var' without the 'posdates'th dimension.

nclusters

This is positive integer K that must be bigger than 1. K is the number of clusters to be computed, or K initial cluster centers to be used in the method. Default is NULL and then user has to specify which index from NbClust and the associated criteria for selecting the optimal number of clusters will be used for K-means clustering of var.

index

A validity index from NbClust package that can be used to determine optimal K if K is not specified as positive integer bigger than 1 or initial/seed cluster centers in nclusters. 'sdindex' is deafult (Halkidi et al. 2001, JIIS). Other indices also available in NBClust are "kl", "ch", "hartigan", "ccc", "scott", "marriot", "trcovw", "tracew", "friedman", "rubin", "cindex", "db", "silhouette", "duda", "pseudot2", "beale", "ratkowsky", "ball", "ptbiserial", "gap", "frey", "mcclain", "gamma", "gplus", "tau", "dunn", "hubert", "sdindex", and "sdbw". One can also use all of them with the option 'alllong' or almost all indices except gap, gamma, gplus and tau with 'all', when the optimal number of clusters K is detremined by the majority rule (the maximum of histogram of the results of all indices with finite solutions). Use of some indices on a big and/or unstructured dataset can be computationally intense and/or could lead to numerical singularity.

posdates

The index of the dimension that corresponds to time in the provided array in the parameter 'var', the first by default.

Value

cluster

A vector (time series) of integers indicating the occurrence of a cluster, i.e., when 'certain data member in time is allocated to a specific cluster (e.g., 2 1 3 1 1 1 ..).

centers

A matrix of cluster centres or centroids (e.g. [1:K, 1:spatial degrees of freedom]).

totss

The total sum of squares.

withinss

A vector of within-cluster sum of squares, one component per cluster.

tot.withinss

Total within-cluster sum of squares, i.e., sum(withinss).

betweenss

The between-cluster sum of squares, i.e. totss-tot.withinss.

size

The number of points in each cluster.

Author(s)

History:
1.0 # 2014-10 (N.S. Fuckar, neven.fuckar@bsc.es) - Original code

References

Wilks, 2011, Statistical Methods in the Atmospheric Sciences, 3rd ed., Elsevire, pp 676.

Examples

# Generating synthetic data
a1 <- array(dim = c(200, 4))
mean1 <- 0
sd1 <- 0.3 

c0 <- seq(1, 200)
c1 <- sort(sample(x = 1:200, size = sample(x = 50:150, size = 1), replace = FALSE))
x1 <- c(1, 1, 1, 1)
for (i1 in c1) {
 a1[i1, ] <- x1 + rnorm(4, mean = mean1, sd = sd1)
}

c1p5 <- c0[!(c0 %in% c1)]
c2 <- c1p5[seq(1, length(c1p5), 2)] 
x2 <- c(2, 2, 4, 4)
for (i2 in c2) {
 a1[i2, ] <- x2 + rnorm(4, mean = mean1, sd = sd1)
}

c3 <- c1p5[seq(2, length(c1p5), 2)]
x3 <- c(3, 3, 1, 1)
for (i3 in c3) {
 a1[i3, ] <- x3 + rnorm(4, mean = mean1, sd = sd1)
}

# Computing the clusters
res1 <- Cluster(var = a1, weights = array(1, dim = dim(a1)[2]), nclusters = 3)
print(res1$cluster)
print(res1$centers)

res2 <- Cluster(var = a1, weights = array(1, dim = dim(a1)[2]))
print(res2$cluster)
print(res2$centers)

s2dverification

Set of Common Tools for Forecast Verification

v2.10.0
Apache License 2.0
Authors
BSC-CNS [aut, cph], Virginie Guemas [aut], Nicolau Manubens [aut], An-Chi Ho [ctb, cre], Nuria Perez-Zanon [ctb], Javier Garcia-Serrano [aut], Neven Fuckar [aut], Louis-Philippe Caron [aut], Omar Bellprat [aut], Luis Rodrigues [aut], Veronica Torralba [aut], Alasdair Hunter [aut], Chloe Prodhomme [aut], Martin Menegoz [aut], Domingo Manubens [ctb], Constantin Ardilouze [ctb], Lauriane Batte [ctb], Fabian Lienert [ctb], Julia Giner [ctb], Jean-Philippe Baudouin [ctb], Nube Gonzalez [ctb], Ludovic Auger [ctb], Nicola Cortesi [ctb], Eleftheria Exarchou [ctb], Ruben Cruz [ctb], Isabel Andreu-Burillo [ctb], Ramiro Saurral [ctb]
Initial release

We don't support your browser anymore

Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.