Tuesday, December 17, 2013

R packages and functions for multivariate analysis

package::function #comments
  • Visualization of multivariate data 
graphics::pairs(), stars(), mosaicplot()
graphics::coplot() #conditioning plot
lattice::xyplot(), splom()
car::scatterplot.matrix()
scatterplot3d::scatterplot3d()
aplpack::spin3R(), faces()
MASS::parcoord
ade4::mstree()
vegan::spantree()
ellipse::plotcorr()
vcd::mosaic()
gclus:: #cluster specific graphical enhancements for scatter plots
xgobi::
rggobi::
  •  Hypothesis testing
ICSNP:: #HotellingsT2 test, non-parametric
cramer::
SpatialNP::
  •  Multivariate distributions
stats::cov(), cor()
INSNP::spatial.median()
MASS::cov.rob()
covRobust::
robustbase::covMCD(), covOGK()
rrcov::
mvtnorm:: #simulation
mnormt::
sn:: #for skew normal and t
ks::rmvnorm.mixt(), dmvnorm.mixt() #comprehensive information on mixtures;
bayesm::rwishart() #Wishart distribution
MCMCpack::rwish()

#multivariate normality test
MVN::HZ.test() #Henze-Zirkler’s Multivariate Normality Test
MVN::mardia.test() #Mardia’s Multivariate Normality Test
MVN::royston.test() #Royston’s Multivariate Normality Test
mvnormtest::mshapiro.test() #Shapiro-Wilk multivariate normality Test
mvoutlier::
energy::mvnorm.etest(), k.sample()
stats::mauchly.test

#Copulas
copula:: #generalized archimedian copula
  •  Linear models
stats::lm() #wiht matrix specified as dependent variable.
stats::anova.mlm(), manova()
PenLNM:: #penalized logistic normal multinomial regression.
sn::msn.mle(), mst.mle() #fit multivariate skew normal and skew t model.
pls:: #partial least squares regression, principle component regression
ppls:: #panelized partial least squares.
dr:: #dimension reduction regression, options: "sir", "save"
plsgenomices::
relaimpo: #relative importance of regression parameters

  • Projection methods
#principal components
stats::prcomp() based on svd(), princomp() based on eigen(). #the former is preferred.
sca::
Hmisc::pc1()
paran:: #Horn's evaluation of the number of dimensions to retain
pcurve:: #principle curve analysis/visualization
gmodels::fast.prcomp(), fast.svd #wide matrices
kernlab::kpca #non-linear principle components.
pcaPP::acpgen(), acprob()
psy::sphpca() #maps into a sphere, fpca(), scree.plot() #some variables as dependent
#Canonical correlation:
stats::cancor()
kernlab::kcca()
corcor::
#Redundancey analysis
calibrate::rda()
fso::
stats::cmdscale()
SensoMineR::MDS.indscal()

  • Classification
#unsupervised
Cluster::
kmeans::hclust()
cluster::
clv::
trimcluster::
clue::
clusterSim::
hybridHclust::
energy::edist(), hclust.energy()
kohonen::
clusterGeneration::
mclust::
MachineLearning:: #tree
rpart::
TWIX::
mvpart::
party::
caret:: #classification and regression training
kknn:: #k-nearest

#supervised
MASS::lda(), qda()
mda::mda() #mixture and flexible discriminant analysis
mars::fda() #multivariate adaptive regression splines. bruto() #adaptive spline backfitting
earth::
rda::
class:: knn()
SensoMineR::FDA()
klaR:: #variable selection and robustness against multicollinearity and visualization
superpc:: #supervised pca
hddplot:: #cross-validated linear discriminant.
ROCR:: #assessing classifier performance

  • Corresponding analysis
MASS::mca(), corresp()
ca::ca()
ade4::mca(), hta()
FactoMineR::CA(), MCA()
homals::


  • Modeling non-Gaussian data
MNP::
polycor::
bayesm::
VGAM::
  • Matrix manipulations
Matrix::
SparseM::
matrixcalc:: matrix differential calculus.
spam::

Monday, December 9, 2013

Test the content of two variables from different files in Shell

If you're working on large datasets through Linux Shell, you may sometime need to check whether a variable/column of file1 is in a certain column of file2. For example, file1 has one column with a list of 6,000 SNPs and file2 has five columns with 250,000 SNPs in the second column. You would check whether the 6000 SNPs are part of the 250,000 SNPs. The AWK function built in Shell has a easy way to realize that:

awk 'FNR==NR {a[$1]=$2; next}{print $1 a[$1]}' file2 file1 | wc -l