Package 'VarSelLCM' reference manual

Title:	Variable Selection for Model-Based Clustering of Mixed-Type Data Set with Missing Values
Description:	Full model selection (detection of the relevant features and estimation of the number of clusters) for model-based clustering (see reference here <doi:10.1007/s11222-016-9670-1>). Data to analyze can be continuous, categorical, integer or mixed. Moreover, missing values can occur and do not necessitate any pre-processing. Shiny application permits an easy interpretation of the results.
Authors:	Matthieu Marbac and Mohammed Sedki
Maintainer:	Mohammed Sedki <[email protected]>
License:	GPL (>= 2)
Version:	2.1.4
Built:	2025-03-05 05:04:32 UTC
Source:	https://github.com/r-forge/varsellcm

Variable Selection for Model-Based Clustering of Mixed-Type Data Set with Missing Values

Description

Model-based clustering with variable selection and estimation of the number of clusters. Data to analyze can be continuous, categorical, integer or mixed. Moreover, missing values can occur and do not necessitate any pre-processing. Shiny application permits an easy interpretation of the results.

Details

Package:	VarSelLCM
Type:	Package
Version:	2.1.2
Date:	2018-06-04
License:	GPL-3
LazyLoad:	yes
URL:	http://varsellcm.r-forge.r-project.org/

The main function to use is VarSelCluster. Function VarSelCluster carries out the model selection (according to AIC, BIC or MICL) and maximum likelihood estimation.

Function VarSelShiny runs a shiny application which permits an easy interpretation of the clustering results.

Function VarSelImputation permits the imputation of missing values by using the model parameters.

Standard tool methods (e.g., summary, print, plot, coef, fitted, predict...) are available for facilitating the interpretation.

Author(s)

Matthieu Marbac and Mohammed Sedki. Maintainer: Mohammed Sedki <[email protected]>

References

Marbac, M. and Sedki, M. (2017). Variable selection for model-based clustering using the integrated completed-data likelihood. Statistics and Computing, 27 (4), 1049-1063.

Marbac, M. and Patin, E. and Sedki, M. (2018). Variable selection for mixed data clustering: Application in human population genomics. Journal of classification, to appear.

Examples

## Not run: 
# Package loading
require(VarSelLCM)

# Data loading:
# x contains the observed variables
# z the known statu (i.e. 1: absence and 2: presence of heart disease)
data(heart)
ztrue <- heart[,"Class"]
x <- heart[,-13]

# Cluster analysis without variable selection
res_without <- VarSelCluster(x, 2, vbleSelec = FALSE, crit.varsel = "BIC")

# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40, crit.varsel = "BIC")

# Comparison of the BIC for both models:
# variable selection permits to improve the BIC
BIC(res_without)
BIC(res_with)

# Comparison of the partition accuracy. 
# ARI is computed between the true partition (ztrue) and its estimators
# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)
# variable selection permits to improve the ARI
# Note that ARI cannot be used for model selection in clustering, because there is no true partition
ARI(ztrue, fitted(res_without))
ARI(ztrue, fitted(res_with))

# Estimated partition
fitted(res_with)

# Estimated probabilities of classification
head(fitted(res_with, type="probability"))

# Summary of the probabilities of missclassification
plot(res_with, type="probs-class")

# Confusion matrices and ARI (only possible because the "true" partition is known).
# ARI is computed between the true partition (ztrue) and its estimators
# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)
# variable selection permits to improve the ARI
# Note that ARI cannot be used for model selection in clustering, because there is no true partition
# variable selection decreases the misclassification error rate
table(ztrue, fitted(res_without))
table(ztrue, fitted(res_with))
ARI(ztrue,  fitted(res_without))
ARI(ztrue, fitted(res_with))

# Summary of the best model
summary(res_with)

# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(res_with)

# More detailed output
print(res_with)

# Print model parameter
coef(res_with)

# Boxplot for the continuous variable MaxHeartRate
plot(x=res_with, y="MaxHeartRate")

# Empirical and theoretical distributions of the most discriminative variable
# (to check that the distribution is well-fitted)
plot(res_with, y="MaxHeartRate", type="cdf")

# Summary of categorical variable
plot(res_with, y="Sex")

# Probabilities of classification for new observations 
predict(res_with, newdata = x[1:3,])

# Imputation by posterior mean for the first observation
not.imputed <- x[1,]
imputed <- VarSelImputation(res_with, x[1,], method = "sampling")
rbind(not.imputed, imputed)

# Opening Shiny application to easily see the results
VarSelShiny(res_with)



## End(Not run)

## Not run: 
# Package loading
require(VarSelLCM)

# Data loading:
# x contains the observed variables
# z the known statu (i.e. 1: absence and 2: presence of heart disease)
data(heart)
ztrue <- heart[,"Class"]
x <- heart[,-13]

# Cluster analysis without variable selection
res_without <- VarSelCluster(x, 2, vbleSelec = FALSE, crit.varsel = "BIC")

# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40, crit.varsel = "BIC")

# Comparison of the BIC for both models:
# variable selection permits to improve the BIC
BIC(res_without)
BIC(res_with)

# Comparison of the partition accuracy. 
# ARI is computed between the true partition (ztrue) and its estimators
# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)
# variable selection permits to improve the ARI
# Note that ARI cannot be used for model selection in clustering, because there is no true partition
ARI(ztrue, fitted(res_without))
ARI(ztrue, fitted(res_with))

# Estimated partition
fitted(res_with)

# Estimated probabilities of classification
head(fitted(res_with, type="probability"))

# Summary of the probabilities of missclassification
plot(res_with, type="probs-class")

# Confusion matrices and ARI (only possible because the "true" partition is known).
# ARI is computed between the true partition (ztrue) and its estimators
# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)
# variable selection permits to improve the ARI
# Note that ARI cannot be used for model selection in clustering, because there is no true partition
# variable selection decreases the misclassification error rate
table(ztrue, fitted(res_without))
table(ztrue, fitted(res_with))
ARI(ztrue,  fitted(res_without))
ARI(ztrue, fitted(res_with))

# Summary of the best model
summary(res_with)

# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(res_with)

# More detailed output
print(res_with)

# Print model parameter
coef(res_with)

# Boxplot for the continuous variable MaxHeartRate
plot(x=res_with, y="MaxHeartRate")

# Empirical and theoretical distributions of the most discriminative variable
# (to check that the distribution is well-fitted)
plot(res_with, y="MaxHeartRate", type="cdf")

# Summary of categorical variable
plot(res_with, y="Sex")

# Probabilities of classification for new observations 
predict(res_with, newdata = x[1:3,])

# Imputation by posterior mean for the first observation
not.imputed <- x[1,]
imputed <- VarSelImputation(res_with, x[1,], method = "sampling")
rbind(not.imputed, imputed)

# Opening Shiny application to easily see the results
VarSelShiny(res_with)



## End(Not run)

AIC criterion.

Description

This function gives the AIC criterion of an instance of VSLCMresults. AIC is computed according to the formula

$AIC=log-likelihood - \nu$

where $\nu$ denotes the number of parameters in the fitted model.

Usage

## S4 method for signature 'VSLCMresults'
AIC(object)
## S4 method for signature 'VSLCMresults'
AIC(object)

Arguments

object

instance of VSLCMresults.

References

Akaike, H. (1974), "A new look at the statistical model identification", IEEE Transactions on Automatic Control, 19 (6): 716-723.

Examples

# Data loading:
data(heart)

# Cluster analysis without variable selection
res <- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the AIC value
AIC(res)
# Data loading:
data(heart)

# Cluster analysis without variable selection
res <- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the AIC value
AIC(res)

Adjusted Rand Index

Description

This function computes the Adjusted Rand Index

Usage

ARI(x, y)
ARI(x, y)

Arguments

`x`	vector defining a partition.
`y`	vector defining a partition of whose length is equal to the length of x.

Value

numeric

References

L. Hubert and P. Arabie (1985) Comparing Partitions, Journal of the Classification, 2, pp. 193-218.

Examples

x <- sample(1:2, 20, replace=TRUE)
y <- x
y[1:5] <- sample(1:2, 5, replace=TRUE)
ARI(x, y)
x <- sample(1:2, 20, replace=TRUE)
y <- x
y[1:5] <- sample(1:2, 5, replace=TRUE)
ARI(x, y)

BIC criterion.

Description

This function gives the BIC criterion of an instance of VSLCMresults. BIC is computed according to the formula

$BIC=log-likelihood - 0.5*\nu*log(n)$

where $\nu$ denotes the number of parameters in the fitted model and $n$ represents the sample size.

Usage

## S4 method for signature 'VSLCMresults'
BIC(object)
## S4 method for signature 'VSLCMresults'
BIC(object)

Arguments

object

instance of VSLCMresults.

References

Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464.

Examples

# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res<- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the BIC value
BIC(res)
# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res<- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the BIC value
BIC(res)

Extract the parameters

Description

This function returns an instance of class VSLCMparam which contains the model parameters.

Usage

## S4 method for signature 'VSLCMresults'
coef(object)
## S4 method for signature 'VSLCMresults'
coef(object)

Arguments

object

instance of VSLCMresults.

Examples

# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res  <- VarSelCluster(heart[,-13], 1:3, vbleSelec = FALSE)

# Get the ICL value
coef(res)
# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res  <- VarSelCluster(heart[,-13], 1:3, vbleSelec = FALSE)

# Get the ICL value
coef(res)

Extract the parameters

Description

This function returns an instance of class VSLCMparam which contains the model parameters.

Usage

## S4 method for signature 'VSLCMresults'
coefficients(object)
## S4 method for signature 'VSLCMresults'
coefficients(object)

Arguments

object

instance of VSLCMresults.

Examples

# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res  <- VarSelCluster(heart[,-13], 1:3, vbleSelec = FALSE)

# Get the ICL value
coefficients(res)
# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res  <- VarSelCluster(heart[,-13], 1:3, vbleSelec = FALSE)

# Get the ICL value
coefficients(res)

Extract the partition or the probabilities of classification

Description

This function returns the probabilities of classification or the partition among the observations of an instance of VSLCMresults.

Usage

## S4 method for signature 'VSLCMresults'
fitted(object, type = "partition")
## S4 method for signature 'VSLCMresults'
fitted(object, type = "partition")

Arguments

`object`	instance of `VSLCMresults`.
`type`	the type of prediction: probability of classification (probability) or the partition (partition)

Examples

# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res <- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the ICL value
fitted(res)
# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res <- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the ICL value
fitted(res)

Extract the partition or the probabilities of classification

Description

This function returns the probabilities of classification or the partition among the observations of an instance of VSLCMresults.

Usage

## S4 method for signature 'VSLCMresults'
fitted.values(object, type = "partition")
## S4 method for signature 'VSLCMresults'
fitted.values(object, type = "partition")

Arguments

`object`	instance of `VSLCMresults`.
`type`	the type of prediction: probability of classification (probability) or the partition (partition)

Examples

# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res <- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the ICL value
fitted.values(res)
# Data loading:
data(heart)

# Cluster analysis without variable selection (number of clusters between 1 and 3)
res <- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the ICL value
fitted.values(res)

Statlog (Heart) Data Set

Description

This dataset is a heart disease database similar to a database already present in the repository (Heart Disease databases) but in a slightly different form.

Details

12 variables are used to cluster the observations

age (integer)
sex (binary)
chest pain type (categorical with 4 levels)
resting blood pressure (continuous)
serum cholestoral in mg/dl (continuous)
fasting blood sugar > 120 mg/dl (binary)
resting electrocardiographic results (categorical with 3 levels)
maximum heart rate achieved (continuous)
exercise induced angina (binary)
the slope of the peak exercise ST segment (categorical with 3 levels)
number of major vessels colored by flourosopy (categorical with 4 levels)
thal: 3 = normal; 6 = fixed defect; 7 = reversable defect (categorical with 3 levels)

1 variable define a ”true” partition: Absence (1) or presence (2) of heart disease

References

UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science: http://archive.ics.uci.edu/ml/datasets/statlog+(heart)

Examples

  data(heart)
data(heart)

ICL criterion

Description

This function gives the ICL criterion for an instance of VSLCMresults.

Usage

ICL(object)
ICL(object)

Arguments

object

VSLCMresults

References

Biernacki, C., Celeux, G., and Govaert, G. (2000). Assessing a mixture model for clustering with the integrated completed likelihood. IEEE transactions on pattern analysis and machine intelligence, 22(7), 719-725.

Examples

# Data loading:
data(heart)

# Cluster analysis without variable selection
res <- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the ICL value
ICL(res)

# Data loading:
data(heart)

# Cluster analysis without variable selection
res <- VarSelCluster(heart[,-13], 2, vbleSelec = FALSE)

# Get the ICL value
ICL(res)

MICL criterion

Description

This function gives the MICL criterion for an instance of VSLCMresults.

Usage

MICL(object)
MICL(object)

Arguments

object

VSLCMresults

References

Marbac, M. and Sedki, M. (2017). Variable selection for model-based clustering using the integrated completed-data likelihood. Statistics and Computing, 27 (4), 1049-1063.

Examples

## Not run: 
# Data loading:
data("heart")

# Cluster analysis with variable selection
object <- VarSelCluster(heart[,-13], 2, vbleSelec = TRUE, crit.varsel = "MICL")

# Get the MICL value
MICL(object)

## End(Not run)
## Not run: 
# Data loading:
data("heart")

# Cluster analysis with variable selection
object <- VarSelCluster(heart[,-13], 2, vbleSelec = TRUE, crit.varsel = "MICL")

# Get the MICL value
MICL(object)

## End(Not run)

Plots of an instance of `VSLCMresults`

Description

This function proposes different plots of an instance of VSLCMresults. It permits to visualize:

the discriminative power of the variables (type="bar" or type="pie"). The larger is the discriminative power of a variable, the more explained are the clusters by this variable.
the probabilities of misclassification (type="probs-overall" or type="probs-class").
the distribution of a signle variable (y is the name of the variable and type="boxplot" or type="cdf").

Usage

## S4 method for signature 'VSLCMresults,character'
plot(x, y, type = "boxplot", ylim = c(1,
  x@data@d))
## S4 method for signature 'VSLCMresults,character'
plot(x, y, type = "boxplot", ylim = c(1,
  x@data@d))

Arguments

`x`	instance of `VSLCMresults`.
`y`	character. The name of the variable to ploted (only used if type="boxplot" or type="cdf").
`type`	character. The type of plot ("bar": barplot of the disciminative power, "pie": pie of the discriminative power, "probs-overall": histogram of the probabilities of misclassification, "probs-class": histogram of the probabilities of misclassification per cluster, "boxplot": boxplot of a single variable per cluster, "cdf": distribution of a single variable per cluster).
`ylim`	numeric. Define the range of the most discriminative variables to considered (only use if type="pie" or type="bar")

Examples

## Not run: 
require(VarSelLCM)

# Data loading:
# x contains the observed variables
# z the known statu (i.e. 1: absence and 2: presence of heart disease)
data(heart)
ztrue <- heart[,"Class"]
x <- heart[,-13]

# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40)

# Summary of the probabilities of missclassification
plot(res_with, type="probs-class")

# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(res_with)

# Boxplot for the continuous variable MaxHeartRate
plot(res_with, y="MaxHeartRate")

# Empirical and theoretical distributions (to check that the distribution is well-fitted)
plot(res_with, y="MaxHeartRate", type="cdf")

# Summary of categorical variable
plot(res_with, y="Sex")

## End(Not run)
## Not run: 
require(VarSelLCM)

# Data loading:
# x contains the observed variables
# z the known statu (i.e. 1: absence and 2: presence of heart disease)
data(heart)
ztrue <- heart[,"Class"]
x <- heart[,-13]

# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40)

# Summary of the probabilities of missclassification
plot(res_with, type="probs-class")

# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(res_with)

# Boxplot for the continuous variable MaxHeartRate
plot(res_with, y="MaxHeartRate")

# Empirical and theoretical distributions (to check that the distribution is well-fitted)
plot(res_with, y="MaxHeartRate", type="cdf")

# Summary of categorical variable
plot(res_with, y="Sex")

## End(Not run)

Prediction of the cluster memberships

Description

This function gives the probabilities of classification for new observations by using the mixture model fit with the function VarSelCluster.

Usage

## S4 method for signature 'VSLCMresults'
predict(object, newdata, type = "probability")
## S4 method for signature 'VSLCMresults'
predict(object, newdata, type = "probability")

Arguments

`object`	instance of `VSLCMresults`.
`newdata`	data.frame of the observations to classify.
`type`	the type of prediction: probability of classification (probability) or the partition (partition)

Value

Returns a matrix of the probabilities of classification.

Print function.

Description

This function gives the print of an instance of VSLCMresults.

Usage

## S4 method for signature 'VSLCMresults'
print(x)
## S4 method for signature 'VSLCMresults'
print(x)

Arguments

`x`	instance of `VSLCMresults`.

Summary function.

Description

This function gives the summary of an instance of VSLCMresults.

Usage

## S4 method for signature 'VSLCMresults'
summary(object)
## S4 method for signature 'VSLCMresults'
summary(object)

Arguments

object

instance of VSLCMresults.

Variable selection and clustering.

Description

This function performs the model selection and the maximum likelihood estimation. It can be used for clustering only (i.e., all the variables are assumed to be discriminative). In this case, you must specify the data to cluster (arg. x), the number of clusters (arg. g) and the option vbleSelec must be FALSE. This function can also be used for variable selection in clustering. In this case, you must specify the data to analyse (arg. x), the number of clusters (arg. g) and the option vbleSelec must be TRUE. Variable selection can be done with BIC, MICL or AIC.

Usage

VarSelCluster(x, gvals, vbleSelec = TRUE, crit.varsel = "BIC",
  initModel = 50, nbcores = 1, discrim = rep(1, ncol(x)), nbSmall = 250,
  iterSmall = 20, nbKeep = 50, iterKeep = 1000, tolKeep = 10^(-6))
VarSelCluster(x, gvals, vbleSelec = TRUE, crit.varsel = "BIC",
  initModel = 50, nbcores = 1, discrim = rep(1, ncol(x)), nbSmall = 250,
  iterSmall = 20, nbKeep = 50, iterKeep = 1000, tolKeep = 10^(-6))

Arguments

`x`	data.frame/matrix. Rows correspond to observations and columns correspond to variables. Continuous variables must be "numeric", count variables must be "integer" and categorical variables must be "factor"
`gvals`	numeric. It defines number of components to consider.
`vbleSelec`	logical. It indicates if a variable selection is done
`crit.varsel`	character. It defines the information criterion used for model selection. Without variable selection, you can use one of the three criteria: "AIC", "BIC" and "ICL". With variable selection, you can use "AIC", BIC" and "MICL".
`initModel`	numeric. It gives the number of initializations of the alternated algorithm maximizing the MICL criterion (only used if crit.varsel="MICL")
`nbcores`	numeric. It defines the numerber of cores used by the alogrithm
`discrim`	numeric. It indicates if each variable is discrimiative (1) or irrelevant (0) (only used if vbleSelec=0)
`nbSmall`	numeric. It indicates the number of SmallEM algorithms performed for the ML inference
`iterSmall`	numeric. It indicates the number of iterations for each SmallEM algorithm
`nbKeep`	numeric. It indicates the number of chains used for the final EM algorithm
`iterKeep`	numeric. It indicates the maximal number of iterations for each EM algorithm
`tolKeep`	numeric. It indicates the maximal gap between two successive iterations of EM algorithm which stops the algorithm

Value

Returns an instance of VSLCMresults.

References

Marbac, M. and Sedki, M. (2017). Variable selection for model-based clustering using the integrated completed-data likelihood. Statistics and Computing, 27 (4), 1049-1063.

Marbac, M. and Patin, E. and Sedki, M. (2018). Variable selection for mixed data clustering: Application in human population genomics. Journal of Classification, to appear.

Examples

## Not run: 
# Package loading
require(VarSelLCM)

# Data loading:
# x contains the observed variables
# z the known statu (i.e. 1: absence and 2: presence of heart disease)
data(heart)
ztrue <- heart[,"Class"]
x <- heart[,-13]

# Cluster analysis without variable selection
res_without <- VarSelCluster(x, 2, vbleSelec = FALSE, crit.varsel = "BIC")

# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40, crit.varsel = "BIC")

# Comparison of the BIC for both models:
# variable selection permits to improve the BIC
BIC(res_without)
BIC(res_with)

# Confusion matrices and ARI (only possible because the "true" partition is known).
# ARI is computed between the true partition (ztrue) and its estimators
# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)
# variable selection permits to improve the ARI
# Note that ARI cannot be used for model selection in clustering, because there is no true partition
# variable selection decreases the misclassification error rate
table(ztrue, fitted(res_without))
table(ztrue, fitted(res_with))
ARI(ztrue,  fitted(res_without))
ARI(ztrue, fitted(res_with))
 
# Estimated partition
fitted(res_with)

# Estimated probabilities of classification
head(fitted(res_with, type="probability"))

# Summary of the probabilities of missclassification
plot(res_with, type="probs-class")

# Summary of the best model
summary(res_with)

# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(res_with)

# More detailed output
print(res_with)

# Print model parameter
coef(res_with)

# Boxplot for the continuous variable MaxHeartRate
plot(x=res_with, y="MaxHeartRate")

# Empirical and theoretical distributions of the most discriminative variable 
# (to check that the distribution is well-fitted)
plot(res_with, y="MaxHeartRate", type="cdf")

# Summary of categorical variable
plot(res_with, y="Sex")

# Probabilities of classification for new observations 
predict(res_with, newdata = x[1:3,])

# Imputation by posterior mean for the first observation
not.imputed <- x[1,]
imputed <- VarSelImputation(res_with, x[1,], method = "sampling")
rbind(not.imputed, imputed)

# Opening Shiny application to easily see the results
VarSelShiny(res_with)



## End(Not run)

## Not run: 
# Package loading
require(VarSelLCM)

# Data loading:
# x contains the observed variables
# z the known statu (i.e. 1: absence and 2: presence of heart disease)
data(heart)
ztrue <- heart[,"Class"]
x <- heart[,-13]

# Cluster analysis without variable selection
res_without <- VarSelCluster(x, 2, vbleSelec = FALSE, crit.varsel = "BIC")

# Cluster analysis with variable selection (with parallelisation)
res_with <- VarSelCluster(x, 2, nbcores = 2, initModel=40, crit.varsel = "BIC")

# Comparison of the BIC for both models:
# variable selection permits to improve the BIC
BIC(res_without)
BIC(res_with)

# Confusion matrices and ARI (only possible because the "true" partition is known).
# ARI is computed between the true partition (ztrue) and its estimators
# ARI is an index between 0 (partitions are independent) and 1 (partitions are equals)
# variable selection permits to improve the ARI
# Note that ARI cannot be used for model selection in clustering, because there is no true partition
# variable selection decreases the misclassification error rate
table(ztrue, fitted(res_without))
table(ztrue, fitted(res_with))
ARI(ztrue,  fitted(res_without))
ARI(ztrue, fitted(res_with))
 
# Estimated partition
fitted(res_with)

# Estimated probabilities of classification
head(fitted(res_with, type="probability"))

# Summary of the probabilities of missclassification
plot(res_with, type="probs-class")

# Summary of the best model
summary(res_with)

# Discriminative power of the variables (here, the most discriminative variable is MaxHeartRate)
plot(res_with)

# More detailed output
print(res_with)

# Print model parameter
coef(res_with)

# Boxplot for the continuous variable MaxHeartRate
plot(x=res_with, y="MaxHeartRate")

# Empirical and theoretical distributions of the most discriminative variable 
# (to check that the distribution is well-fitted)
plot(res_with, y="MaxHeartRate", type="cdf")

# Summary of categorical variable
plot(res_with, y="Sex")

# Probabilities of classification for new observations 
predict(res_with, newdata = x[1:3,])

# Imputation by posterior mean for the first observation
not.imputed <- x[1,]
imputed <- VarSelImputation(res_with, x[1,], method = "sampling")
rbind(not.imputed, imputed)

# Opening Shiny application to easily see the results
VarSelShiny(res_with)



## End(Not run)

Imputation of missing values

Description

This function permits imputation of missing values in a dataset by using mixture model. Two methods can be used for imputation:

posterior mean (method="postmean")
sampling from the full conditionnal distribution (method="sampling")

Usage

VarSelImputation(obj, newdata, method = "postmean")
VarSelImputation(obj, newdata, method = "postmean")

Arguments

`obj`	an instance of VSLCMresults which defines the model used for imputation.
`newdata`	data.frame Dataset containing the missing values to impute.
`method`	character definiting the method of imputation: "postmean" or "sampling"

Examples

# Data loading
data("heart")

# Clustering en 2 classes
results <- VarSelCluster(heart[,-13], 2)

# Data where missing values will be imputed
newdata <- heart[1:2,-13]
newdata[1,1] <- NA
newdata[2,2] <- NA

# Imputation
VarSelImputation(results, newdata)

# Data loading
data("heart")

# Clustering en 2 classes
results <- VarSelCluster(heart[,-13], 2)

# Data where missing values will be imputed
newdata <- heart[1:2,-13]
newdata[1,1] <- NA
newdata[2,2] <- NA

# Imputation
VarSelImputation(results, newdata)

Shiny app for analyzing results from VarSelCluster

Description

Shiny app for analyzing results from VarSelCluster

Usage

VarSelShiny(X)
VarSelShiny(X)

Arguments

`X`	an instance of VSLCMresults returned by function VarSelCluster.

Examples

## Not run: 
# Data loading
data("heart")
# Clustering en 2 classes
results <- VarSelCluster(heart[,-13], 2)
# Opening Shiny application to easily see the results
VarSelShiny(results)

## End(Not run)

## Not run: 
# Data loading
data("heart")
# Clustering en 2 classes
results <- VarSelCluster(heart[,-13], 2)
# Opening Shiny application to easily see the results
VarSelShiny(results)

## End(Not run)

Constructor of `VSLCMcriteria` class

Description

loglikelihood: numeric. Log-likelihood
AIC: numeric. Value of the AIC criterion.
BIC: numeric. Value of the BIC criterion.
ICL: numeric. Value of the ICL criterion.
MICL: numeric. Value of the MICL criterion.
nbparam: integer. Number of parameters.
cvrate: numeric. Rate of convergence of the alternated algorithm for optimizing the MICL criterion.
degeneracyrate: numeric. Rate of degeneracy for the selected model.
discrim: numeric. Discriminative power of each variable.

Examples

  getSlots("VSLCMcriteria")

getSlots("VSLCMcriteria")

Constructor of `VSLCMdata` class

Description

n: number of observations
d: number of variables
withContinuous: logical indicating if some variables are continuous
withInteger: logical indicating if some variables are integer
withCategorica: logical indicating if some variables are categorical
dataContinuous: instance of VSLCMdataContinuous containing the continuous data
dataInteger: instance of VSLCMdataContinuous containing the integer data
dataCategorical: instance of VSLCMdataContinuous containing the categorical data
var.names: labels of the variables

Examples

  getSlots("VSLCMdata")

getSlots("VSLCMdata")

Constructor of `VSLCMmodel` class

Description

g: numeric. Number of components.
omega: logical. Vector indicating if each variable is irrelevant (1) or not (0) to the clustering.
names.relevant: character. Names of the relevant variables.
names.irrelevant: character. Names of the irrelevant variables.

Examples

  getSlots("VSLCMmodel")

getSlots("VSLCMmodel")

Constructor of `VSLCMparam` class

Description

pi: numeric. Proportions of the mixture components.
paramContinuous: VSLCMparamContinuous. Parameters of the continuous variables.
paramInteger: VSLCMparamInteger. Parameters of the integer variables.
paramCategorical: VSLCMparamCategorical. Parameters of the categorical variables.

Examples

  getSlots("VSLCMparam")

getSlots("VSLCMparam")

Constructor of `VSLCMparamCategorical` class

Description

pi: numeric. Proportions of the mixture components.
alpha: list. Parameters of the multinomial distributions.

Examples

  getSlots("VSLCMparamCategorical")

getSlots("VSLCMparamCategorical")

Constructor of `VSLCMparamContinuous` class

Description

pi: numeric. Proportions of the mixture components.
mu: matrix. Mean for each component (column) and each variable (row).
sd: matrix. Standard deviation for each component (column) and each variable (row).

Examples

  getSlots("VSLCMparamContinuous")

getSlots("VSLCMparamContinuous")

Constructor of `VSLCMparamInteger` class

Description

pi: numeric. Proportions of the mixture components.
lambda: matrix. Mean for each component (column) and each variable (row).

Examples

  getSlots("VSLCMparamInteger")

getSlots("VSLCMparamInteger")

Constructor of `VSLCMpartitions` class

Description

zMAP: numeric. A vector indicating the class membership of each individual by using the MAP rule computed for the best model with its maximum likelihood estimates.
zOPT: numeric. Partition maximizing the integrated complete-data likelihood of the selected model.
tik: numeric. Fuzzy partition computed for the best model with its maximum likelihood estimates.

Examples

  getSlots("VSLCMpartitions")

getSlots("VSLCMpartitions")

Constructor of `VSLCMresults` class

Description

data: VSLCMdata. Results relied to the data.
criteria: VSLCMcriteria. Results relied to the information criteria.
partitions: VSLCMpartitions. Results relied to the partitions.
model: VSLCMmodel. Results relied to the selected model.
strategy: VSLCMstrategy. Results relied to the tune parameters.
param: VSLCMparam. Results relied to the parameters.

Examples

  getSlots("VSLCMresults")

getSlots("VSLCMresults")

Constructor of `VSLCMstrategy` class

Description

initModel: numeric. Number of initialisations for the model selection algorithm.
vbleSelec: logical. It indicates if the selection of the variables is performed.
paramEstim: logical. It indicates if the parameter estimation is performed.
parallel: logical. It indicates if a parallelisation is done.
nbSmall: numeric. It indicates the number of small EM.
iterSmall: numeric. It indicates the number of iteration for the small EM
nbKeep: numeric. It indicates the number of chains kept for the EM.
iterKeep: numeric. It indicates the maximum number of iteration for the EM.
tolKeep: numeric. It indicates the value of the difference between successive iterations of EM stopping the EM.

Examples

  getSlots("VSLCMstrategy")

getSlots("VSLCMstrategy")

Package 'VarSelLCM'

Help Index

Variable Selection for Model-Based Clustering of Mixed-Type Data Set with Missing Values

Description

Details

Author(s)

References

Examples

AIC criterion.

Description

Usage

Arguments

References

Examples

Adjusted Rand Index

Description

Usage

Arguments

Value

References

Examples

BIC criterion.

Description

Usage

Arguments

References

Examples

Extract the parameters

Description

Usage

Arguments

Examples

Extract the parameters

Description

Usage

Arguments

Examples

Extract the partition or the probabilities of classification

Description

Usage

Arguments

Examples

Extract the partition or the probabilities of classification

Description

Usage

Arguments

Examples

Statlog (Heart) Data Set

Description

Details

References

Examples

ICL criterion

Description

Usage

Arguments

References

Examples

MICL criterion

Description

Usage

Arguments

References

Examples

Plots of an instance of VSLCMresults

Description

Usage

Arguments

Examples

Prediction of the cluster memberships

Description

Usage

Arguments

Value

Print function.

Description

Usage

Arguments

Summary function.

Description

Plots of an instance of `VSLCMresults`

Constructor of `VSLCMcriteria` class

Constructor of `VSLCMdata` class

Constructor of `VSLCMmodel` class

Constructor of `VSLCMparam` class

Constructor of `VSLCMparamCategorical` class

Constructor of `VSLCMparamContinuous` class

Constructor of `VSLCMparamInteger` class

Constructor of `VSLCMpartitions` class

Constructor of `VSLCMresults` class

Constructor of `VSLCMstrategy` class