Package 'Kmedians' reference manual

Title:	K-Medians
Description:	Online, Semi-online, and Offline K-medians algorithms are given. For both methods, the algorithms can be initialized randomly or with the help of a robust hierarchical clustering. The number of clusters can be selected with the help of a penalized criterion. We provide functions to provide robust clustering. Function gen_K() enables to generate a sample of data following a contaminated Gaussian mixture. Functions Kmedians() and Kmeans() consists in a K-median and a K-means algorithms while Kplot() enables to produce graph for both methods. Cardot, H., Cenac, P. and Zitt, P-A. (2013). "Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm". Bernoulli, 19, 18-43. <doi:10.3150/11-BEJ390>. Cardot, H. and Godichon-Baggioni, A. (2017). "Fast Estimation of the Median Covariation Matrix with Application to Online Robust Principal Components Analysis". Test, 26(3), 461-480 <doi:10.1007/s11749-016-0519-x>. Godichon-Baggioni, A. and Surendran, S. "A penalized criterion for selecting the number of clusters for K-medians" <arXiv:2209.03597> Vardi, Y. and Zhang, C.-H. (2000). "The multivariate L1-median and associated data depth". Proc. Natl. Acad. Sci. USA, 97(4):1423-1426. <doi:10.1073/pnas.97.4.1423>.
Authors:	Antoine Godichon-Baggioni [aut, cre, cph], Sobihan Surendran [aut]
Maintainer:	Antoine Godichon-Baggioni <[email protected]>
License:	GPL (>= 2)
Version:	2.2.0
Built:	2025-03-13 04:01:14 UTC
Source:	https://github.com/cran/Kmedians

K-Medians

Description

We provide functions to provide robust clustering. Function gen_K enables to generate a sample of data following a contaminated Gaussian mixture. Functions Kmedians and Kmeans consists in a K-median and a K-means algorithms while Kplot enables to produce graph for both methods.

Author(s)

Antoine Godichon-Baggioni [aut, cre, cph], Sobihan Surendran [aut]

Maintainer: Antoine Godichon-Baggioni <[email protected]>

References

Cardot, H., Cenac, P. and Zitt, P-A. (2013). Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli, 19, 18-43.

Cardot, H. and Godichon-Baggioni, A. (2017). Fast Estimation of the Median Covariation Matrix with Application to Online Robust Principal Components Analysis. Test, 26(3), 461-480

Godichon-Baggioni, A. and Surendran, S. A penalized criterion for selecting the number of clusters for K-medians. arxiv.org/abs/2209.03597

Vardi, Y. and Zhang, C.-H. (2000). The multivariate L1-median and associated data depth. Proc. Natl. Acad. Sci. USA, 97(4):1423-1426.

gen_K

Description

Generate a sample of a Gaussian Mixture Model whose centers are generate randomly on a sphere of radius radius.

Usage

gen_K(n=500,d=5,K=3,pcont=0,df=1,
      cont="Student",min=-5,max=5,radius=5)
gen_K(n=500,d=5,K=3,pcont=0,df=1,
      cont="Student",min=-5,max=5,radius=5)

Arguments

`n`	A positive integer giving the number of data per cluster. Default is `500`.
`d`	A positive integer giving the dimension. Default is `5`.
`K`	A positive integer giving the number of clusters. Default is `3`.
`pcont`	A scalar between `0` and `1` giving the proportion of contaminated data.
`df`	A positive integer giving the degrees of freedom of the law of the contaminated data if `cont='Student'`. Default is `1`.
`cont`	The law of the contaminated data. Can be `'Student'` (default) and `'Unif'`.
`min`	A scalar giving the lower bound of the uniform law if `cont='Unif'`. Default is `-5`.
`max`	A scalar giving the upper bound of the uniform law if `cont='Unif'`. Default is `5`.
`radius`	The radius of the sphere on each the centers of the class are generated. Default is `5`.

Value

A list with:

`X`	A numerical matrix giving the generated data.
`cluster`	An character vector specifying the true classification.

Examples

n <- 500
K <- 3
pcont <- 0.2
ech <- gen_K(n=n,K=K,pcont=pcont)
X=ech$X
n <- 500
K <- 3
pcont <- 0.2
ech <- gen_K(n=n,K=K,pcont=pcont)
X=ech$X

Kmeans

Description

A K-means algorithm.

Usage

Kmeans(X,nclust=1:15,ninit=1,niter=20,par=TRUE)
Kmeans(X,nclust=1:15,ninit=1,niter=20,par=TRUE)

Arguments

`X`	A numerical matrix giving the data.
`nclust`	A vector of positive integers giving the possible numbers of clusters. Default is `1:15`.
`ninit`	A non negative integer giving the number of random initializations. Default is `1`.
`niter`	A positive integer giving the number of iterations for the EM algorirthms. Default is `20`.
`par`	A logical argument telling if the parallelization of the algorithm is allowed. Default is `TRUE`.

Value

A list with:

`bestresults`	A list giving all the results for the clustering selected by `'capushe'`.
`allresults`	A list containing all the results.
`SE`	A vector giving the Sum of Errors for each considered number of clusters.
`cap`	The results given by the function `'capushe'` if `nclust` is of length larger than `10`.
`Ksel`	An integer giving the number of clusters selected by `capushe` if `nclust` is of length larger than `10`.
`data`	A numerical matrix giving the data.
`nclust`	A vector of positive integers giving the considered numbers of clusters.

For the lists bestresult and allresults:

`cluster`	A vector of positive integers giving the clustering.
`centers`	A numerical matrix giving the centers of the clusteres.
`SE`	An integer giving the Sum of Errors.

Examples

## Not run: 
n <- 500
K <- 3
pcont <- 0.2
ech <- gen_K(n=n,K=K,pcont=pcont)
X <-ech$X
res <- Kmeans(X,par=FALSE)
Kplot(res)

## End(Not run)
## Not run: 
n <- 500
K <- 3
pcont <- 0.2
ech <- gen_K(n=n,K=K,pcont=pcont)
X <-ech$X
res <- Kmeans(X,par=FALSE)
Kplot(res)

## End(Not run)

Kmedians

Description

K-medians algorithms.

Usage

Kmedians(X,nclust=1:15,ninit=0,niter=20,
         method='Offline', init=TRUE,par=TRUE)
Kmedians(X,nclust=1:15,ninit=0,niter=20,
         method='Offline', init=TRUE,par=TRUE)

Arguments

`X`	A numerical matrix giving the data.
`nclust`	A vector of positive integers giving the possible numbers of clusters. Default is `1:15`.
`ninit`	A non negative integer giving the number of random initializations. Default is `0`.
`niter`	A positive integer giving the number of iterations for the EM algorirthms. Default is `20`.
`method`	The selected method for the K-medians algorithm. Can be `'Offline'` (default), `'Semi-Online'` or `'Online'`.
`init`	A logical argument telling if the function `'genie'` is used for initializing the algorithm. Default is `TRUE`.
`par`	A logical argument telling if the parallelization of the algorithm is allowed. Default is `TRUE`.

Value

A list with:

`bestresults`	A list giving all the results for the clustering selected by `'capushe'`.
`allresults`	A list containing all the results.
`SE`	A vector giving the Sum of Errors for each considered number of clusters.
`cap`	The results given by the function `'capushe'` if `nclust` is of length larger than `10`.
`Ksel`	An integer giving the number of clusters selected by `'capushe'` if `nclust` is of length larger than `10`.
`data`	A numerical matrix giving the data.
`nclust`	A vector of positive integers giving the considered numbers of clusters.

For the lists bestresult and allresults:

`cluster`	A vector of positive integers giving the clustering.
`centers`	A numerical matrix giving the centers of the clusteres.
`SE`	An integer giving the Sum of Errors.

References

Godichon-Baggioni, A. and Surendran, S. A penalized criterion for selecting the number of clusters for K-medians. arxiv.org/abs/2209.03597

Examples

## Not run: 
n <- 500
K <- 3
pcont <- 0.2
ech <- gen_K(n=n,K=K,pcont=pcont)
X <-ech$X
res <- Kmedians(X,par=FALSE)
Kplot(res)

## End(Not run)
## Not run: 
n <- 500
K <- 3
pcont <- 0.2
ech <- gen_K(n=n,K=K,pcont=pcont)
X <-ech$X
res <- Kmedians(X,par=FALSE)
Kplot(res)

## End(Not run)

Kplot

Description

A plot function for K-medians and K-means

Usage

Kplot(a,propplot=0.95,graph=c('Two_Dim','Capushe','Profiles','SE','Criterion'),
      bestresult=TRUE,Ksel=FALSE,bycluster=TRUE)
Kplot(a,propplot=0.95,graph=c('Two_Dim','Capushe','Profiles','SE','Criterion'),
      bestresult=TRUE,Ksel=FALSE,bycluster=TRUE)

Arguments

`a`	Output from `Kmedians` or `Kmeans`.
`propplot`	A scalar between `0` and `1` giving the propotion of data considered for the different graphs.
`graph`	A string specifying the type of graph requested. Default is `c('Two_Dim','Capushe','Profiles','SE','Criterion')`.
`bestresult`	A logical indicating if the graphs must be done for the result chosen by the selected criterion. Default is `TRUE`.
`Ksel`	A logical or positive integer giving the chosen number of clusters for each the graphs should be drawn.
`bycluster`	A logical indicating if the data selected for `'Two_Dim'` and `'Profiles'` graphs should be selected by cluster or not. Default is `TRUE`.

Value

No return value.

Examples

## Not run: 
n <- 500
K <- 3
pcont <- 0.2
ech <- gen_K(n=n,K=K,pcont=pcont)
X <-ech$X
res <- Kmedians(X,par=FALSE)
Kplot(res)

## End(Not run)
## Not run: 
n <- 500
K <- 3
pcont <- 0.2
ech <- gen_K(n=n,K=K,pcont=pcont)
X <-ech$X
res <- Kmedians(X,par=FALSE)
Kplot(res)

## End(Not run)

Package 'Kmedians'

Help Index

K-Medians

Description

Author(s)

References

gen_K

Description

Usage

Arguments

Value

See Also

Examples

Kmeans

Description

Usage

Arguments

Value

See Also

Examples

Kmedians

Description

Usage

Arguments

Value

References

See Also

Examples

Kplot

Description

Usage

Arguments

Value

See Also

Examples