Title: | K-Medians |
---|---|
Description: | Online, Semi-online, and Offline K-medians algorithms are given. For both methods, the algorithms can be initialized randomly or with the help of a robust hierarchical clustering. The number of clusters can be selected with the help of a penalized criterion. We provide functions to provide robust clustering. Function gen_K() enables to generate a sample of data following a contaminated Gaussian mixture. Functions Kmedians() and Kmeans() consists in a K-median and a K-means algorithms while Kplot() enables to produce graph for both methods. Cardot, H., Cenac, P. and Zitt, P-A. (2013). "Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm". Bernoulli, 19, 18-43. <doi:10.3150/11-BEJ390>. Cardot, H. and Godichon-Baggioni, A. (2017). "Fast Estimation of the Median Covariation Matrix with Application to Online Robust Principal Components Analysis". Test, 26(3), 461-480 <doi:10.1007/s11749-016-0519-x>. Godichon-Baggioni, A. and Surendran, S. "A penalized criterion for selecting the number of clusters for K-medians" <arXiv:2209.03597> Vardi, Y. and Zhang, C.-H. (2000). "The multivariate L1-median and associated data depth". Proc. Natl. Acad. Sci. USA, 97(4):1423-1426. <doi:10.1073/pnas.97.4.1423>. |
Authors: | Antoine Godichon-Baggioni [aut, cre, cph], Sobihan Surendran [aut] |
Maintainer: | Antoine Godichon-Baggioni <[email protected]> |
License: | GPL (>= 2) |
Version: | 2.2.0 |
Built: | 2025-03-13 04:01:14 UTC |
Source: | https://github.com/cran/Kmedians |
We provide functions to provide robust clustering. Function gen_K
enables to generate a sample of data following a contaminated Gaussian mixture. Functions Kmedians
and Kmeans
consists in a K-median and a K-means algorithms while Kplot
enables to produce graph for both methods.
Antoine Godichon-Baggioni [aut, cre, cph], Sobihan Surendran [aut]
Maintainer: Antoine Godichon-Baggioni <[email protected]>
Cardot, H., Cenac, P. and Zitt, P-A. (2013). Efficient and fast estimation of the geometric median in Hilbert spaces with an averaged stochastic gradient algorithm. Bernoulli, 19, 18-43.
Cardot, H. and Godichon-Baggioni, A. (2017). Fast Estimation of the Median Covariation Matrix with Application to Online Robust Principal Components Analysis. Test, 26(3), 461-480
Godichon-Baggioni, A. and Surendran, S. A penalized criterion for selecting the number of clusters for K-medians. arxiv.org/abs/2209.03597
Vardi, Y. and Zhang, C.-H. (2000). The multivariate L1-median and associated data depth. Proc. Natl. Acad. Sci. USA, 97(4):1423-1426.
Generate a sample of a Gaussian Mixture Model whose centers are generate randomly on a sphere of radius radius
.
gen_K(n=500,d=5,K=3,pcont=0,df=1, cont="Student",min=-5,max=5,radius=5)
gen_K(n=500,d=5,K=3,pcont=0,df=1, cont="Student",min=-5,max=5,radius=5)
n |
A positive integer giving the number of data per cluster. Default is |
d |
A positive integer giving the dimension. Default is |
K |
A positive integer giving the number of clusters. Default is |
pcont |
A scalar between |
df |
A positive integer giving the degrees of freedom of the law of the contaminated data if |
cont |
The law of the contaminated data. Can be |
min |
A scalar giving the lower bound of the uniform law if |
max |
A scalar giving the upper bound of the uniform law if |
radius |
The radius of the sphere on each the centers of the class are generated. Default is |
A list with:
X |
A numerical matrix giving the generated data. |
cluster |
An character vector specifying the true classification. |
n <- 500 K <- 3 pcont <- 0.2 ech <- gen_K(n=n,K=K,pcont=pcont) X=ech$X
n <- 500 K <- 3 pcont <- 0.2 ech <- gen_K(n=n,K=K,pcont=pcont) X=ech$X
A K-means algorithm.
Kmeans(X,nclust=1:15,ninit=1,niter=20,par=TRUE)
Kmeans(X,nclust=1:15,ninit=1,niter=20,par=TRUE)
X |
A numerical matrix giving the data. |
nclust |
A vector of positive integers giving the possible numbers of clusters. Default is |
ninit |
A non negative integer giving the number of random initializations. Default is |
niter |
A positive integer giving the number of iterations for the EM algorirthms. Default is |
par |
A logical argument telling if the parallelization of the algorithm is allowed. Default is |
A list with:
bestresults |
A list giving all the results for the clustering selected by |
allresults |
A list containing all the results. |
SE |
A vector giving the Sum of Errors for each considered number of clusters. |
cap |
The results given by the function |
Ksel |
An integer giving the number of clusters selected by |
data |
A numerical matrix giving the data. |
nclust |
A vector of positive integers giving the considered numbers of clusters. |
For the lists bestresult
and allresults
:
cluster |
A vector of positive integers giving the clustering. |
centers |
A numerical matrix giving the centers of the clusteres. |
SE |
An integer giving the Sum of Errors. |
See also Kmedians
, Kplot
and gen_K
.
## Not run: n <- 500 K <- 3 pcont <- 0.2 ech <- gen_K(n=n,K=K,pcont=pcont) X <-ech$X res <- Kmeans(X,par=FALSE) Kplot(res) ## End(Not run)
## Not run: n <- 500 K <- 3 pcont <- 0.2 ech <- gen_K(n=n,K=K,pcont=pcont) X <-ech$X res <- Kmeans(X,par=FALSE) Kplot(res) ## End(Not run)
K-medians algorithms.
Kmedians(X,nclust=1:15,ninit=0,niter=20, method='Offline', init=TRUE,par=TRUE)
Kmedians(X,nclust=1:15,ninit=0,niter=20, method='Offline', init=TRUE,par=TRUE)
X |
A numerical matrix giving the data. |
nclust |
A vector of positive integers giving the possible numbers of clusters. Default is |
ninit |
A non negative integer giving the number of random initializations. Default is |
niter |
A positive integer giving the number of iterations for the EM algorirthms. Default is |
method |
The selected method for the K-medians algorithm. Can be |
init |
A logical argument telling if the function |
par |
A logical argument telling if the parallelization of the algorithm is allowed. Default is |
A list with:
bestresults |
A list giving all the results for the clustering selected by |
allresults |
A list containing all the results. |
SE |
A vector giving the Sum of Errors for each considered number of clusters. |
cap |
The results given by the function |
Ksel |
An integer giving the number of clusters selected by |
data |
A numerical matrix giving the data. |
nclust |
A vector of positive integers giving the considered numbers of clusters. |
For the lists bestresult
and allresults
:
cluster |
A vector of positive integers giving the clustering. |
centers |
A numerical matrix giving the centers of the clusteres. |
SE |
An integer giving the Sum of Errors. |
Godichon-Baggioni, A. and Surendran, S. A penalized criterion for selecting the number of clusters for K-medians. arxiv.org/abs/2209.03597
See also Kmeans
, Kplot
and gen_K
.
## Not run: n <- 500 K <- 3 pcont <- 0.2 ech <- gen_K(n=n,K=K,pcont=pcont) X <-ech$X res <- Kmedians(X,par=FALSE) Kplot(res) ## End(Not run)
## Not run: n <- 500 K <- 3 pcont <- 0.2 ech <- gen_K(n=n,K=K,pcont=pcont) X <-ech$X res <- Kmedians(X,par=FALSE) Kplot(res) ## End(Not run)
A plot function for K-medians and K-means
Kplot(a,propplot=0.95,graph=c('Two_Dim','Capushe','Profiles','SE','Criterion'), bestresult=TRUE,Ksel=FALSE,bycluster=TRUE)
Kplot(a,propplot=0.95,graph=c('Two_Dim','Capushe','Profiles','SE','Criterion'), bestresult=TRUE,Ksel=FALSE,bycluster=TRUE)
a |
|
propplot |
A scalar between |
graph |
A string specifying the type of graph requested.
Default is |
bestresult |
A logical indicating if the graphs must be done for the result chosen by the selected criterion. Default is |
Ksel |
A logical or positive integer giving the chosen number of clusters for each the graphs should be drawn. |
bycluster |
A logical indicating if the data selected for |
No return value.
## Not run: n <- 500 K <- 3 pcont <- 0.2 ech <- gen_K(n=n,K=K,pcont=pcont) X <-ech$X res <- Kmedians(X,par=FALSE) Kplot(res) ## End(Not run)
## Not run: n <- 500 K <- 3 pcont <- 0.2 ech <- gen_K(n=n,K=K,pcont=pcont) X <-ech$X res <- Kmedians(X,par=FALSE) Kplot(res) ## End(Not run)