Tuesday, December 17, 2013

R packages and functions for multivariate analysis

package::function #comments
  • Visualization of multivariate data 
graphics::pairs(), stars(), mosaicplot()
graphics::coplot() #conditioning plot
lattice::xyplot(), splom()
car::scatterplot.matrix()
scatterplot3d::scatterplot3d()
aplpack::spin3R(), faces()
MASS::parcoord
ade4::mstree()
vegan::spantree()
ellipse::plotcorr()
vcd::mosaic()
gclus:: #cluster specific graphical enhancements for scatter plots
xgobi::
rggobi::
  •  Hypothesis testing
ICSNP:: #HotellingsT2 test, non-parametric
cramer::
SpatialNP::
  •  Multivariate distributions
stats::cov(), cor()
INSNP::spatial.median()
MASS::cov.rob()
covRobust::
robustbase::covMCD(), covOGK()
rrcov::
mvtnorm:: #simulation
mnormt::
sn:: #for skew normal and t
ks::rmvnorm.mixt(), dmvnorm.mixt() #comprehensive information on mixtures;
bayesm::rwishart() #Wishart distribution
MCMCpack::rwish()

#multivariate normality test
MVN::HZ.test() #Henze-Zirkler’s Multivariate Normality Test
MVN::mardia.test() #Mardia’s Multivariate Normality Test
MVN::royston.test() #Royston’s Multivariate Normality Test
mvnormtest::mshapiro.test() #Shapiro-Wilk multivariate normality Test
mvoutlier::
energy::mvnorm.etest(), k.sample()
stats::mauchly.test

#Copulas
copula:: #generalized archimedian copula
  •  Linear models
stats::lm() #wiht matrix specified as dependent variable.
stats::anova.mlm(), manova()
PenLNM:: #penalized logistic normal multinomial regression.
sn::msn.mle(), mst.mle() #fit multivariate skew normal and skew t model.
pls:: #partial least squares regression, principle component regression
ppls:: #panelized partial least squares.
dr:: #dimension reduction regression, options: "sir", "save"
plsgenomices::
relaimpo: #relative importance of regression parameters

  • Projection methods
#principal components
stats::prcomp() based on svd(), princomp() based on eigen(). #the former is preferred.
sca::
Hmisc::pc1()
paran:: #Horn's evaluation of the number of dimensions to retain
pcurve:: #principle curve analysis/visualization
gmodels::fast.prcomp(), fast.svd #wide matrices
kernlab::kpca #non-linear principle components.
pcaPP::acpgen(), acprob()
psy::sphpca() #maps into a sphere, fpca(), scree.plot() #some variables as dependent
#Canonical correlation:
stats::cancor()
kernlab::kcca()
corcor::
#Redundancey analysis
calibrate::rda()
fso::
stats::cmdscale()
SensoMineR::MDS.indscal()

  • Classification
#unsupervised
Cluster::
kmeans::hclust()
cluster::
clv::
trimcluster::
clue::
clusterSim::
hybridHclust::
energy::edist(), hclust.energy()
kohonen::
clusterGeneration::
mclust::
MachineLearning:: #tree
rpart::
TWIX::
mvpart::
party::
caret:: #classification and regression training
kknn:: #k-nearest

#supervised
MASS::lda(), qda()
mda::mda() #mixture and flexible discriminant analysis
mars::fda() #multivariate adaptive regression splines. bruto() #adaptive spline backfitting
earth::
rda::
class:: knn()
SensoMineR::FDA()
klaR:: #variable selection and robustness against multicollinearity and visualization
superpc:: #supervised pca
hddplot:: #cross-validated linear discriminant.
ROCR:: #assessing classifier performance

  • Corresponding analysis
MASS::mca(), corresp()
ca::ca()
ade4::mca(), hta()
FactoMineR::CA(), MCA()
homals::


  • Modeling non-Gaussian data
MNP::
polycor::
bayesm::
VGAM::
  • Matrix manipulations
Matrix::
SparseM::
matrixcalc:: matrix differential calculus.
spam::

Monday, December 9, 2013

Test the content of two variables from different files in Shell

If you're working on large datasets through Linux Shell, you may sometime need to check whether a variable/column of file1 is in a certain column of file2. For example, file1 has one column with a list of 6,000 SNPs and file2 has five columns with 250,000 SNPs in the second column. You would check whether the 6000 SNPs are part of the 250,000 SNPs. The AWK function built in Shell has a easy way to realize that:

awk 'FNR==NR {a[$1]=$2; next}{print $1 a[$1]}' file2 file1 | wc -l


Thursday, November 28, 2013

Add python to Windows 7 DOS Path

Run Python from Window 7 DOS is not as straightforward as in Mac Os X.  You may meet this prompt if you haven't set the python path in DOS:

 C:\Users\dz2t >python
'python' is not recognized as an internal or external command,
operable program or batch file.

Here is a way to solve this problem:
click: Start --> Control Panel --> System and Security -->System

On the left side of the screen, you should be able to see 'Advanced system settings', and then click it.
Then at the bottom click 'Environment Variables'. In the System variables box, find and click something like 'Path    C:....'. Finally, in the pop 'Edit System Variable' box, you need to add the python path to the 'Variable value'. Usually add this text to the end of the box:

;C:\Python33;C:\Python33\scripts

Then from the DOS window you can evoke python directly.


Tuesday, November 19, 2013

Setting R Path in Mac OS for Latex

There's a nice latex() function from R package 'Hmisc' to generate LaTex tables automatically,  but the function might not work for Mac OS X platform due to the path environmental variables in R isn’t set to include the path to TeX binaries.

Here, we can see that the R path does not include the path to TeX binaries:
> Sys.getenv("PATH")
[1] "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin"
 

So, if you try to call the latex() function, it will give an error message like that:

library(Hmisc) # Load Hmisc to load latex()
x <- matrix(1:6, nrow=2, dimnames=list(c('a','b'),c('c','d','this that'))) # From latex() examples
latex(x)
error /bin/sh: latex: command not found
error in system(cmd, intern = TRUE, wait = TRUE) :
error in running command
sh: xdvi: command not found
To add the Tex PATH into the R envirnment, following my last blog to create .Rprofile and add the following command to the file:
Sys.setenv(PATH=paste(Sys.getenv("PATH"),"/usr/texbin",sep=":"))
After reboot R, you will find the tex path was added.
> Sys.getenv("PATH")
[1] "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/texbin"
 

Monday, November 18, 2013

Add default packages in R

Checking around seems there's no a detail description on how to add your frequent-using-packages as your default loadings in R. Here is an easy way to make it (as least works for MAC OS).

By launching R you will see this message:
----------------------------------------------------------------
...
[R.app GUI 1.61 (6492) x86_64-apple-darwin10.8.0]

[Workspace restored from /Users/dz2t/.RData]
[History restored from /Users/dz2t/.Rapp.history]


----------------------------------------------------------------

Here, the directory ''/Users/dz2t/" is the HOME folder that R will search and run when it's initiating. To load a set of default stuffs (not only packages), what you need to do is just to create a ".Rprofile" file in this directory. I use the mac terminal to create this file and I guess you can also use any text editor to do so and save it under an exact name: .Rprofile. And finally the most important and also a easy thing next to do is to list the libraries in the file and save it like that:
----------------------------------------------------------------
options(width=65, digits=5)
options(defaultPackages=c(getOption("defaultPackages"), "ggplot2", "mice", "grid", "gtools",
                        "gdata", "pscl", "zoo", "car", "DirichletReg", "plyr", "bestglm", "tree",
                        "betareg", "psych", "gee", "ggdendro", "sas7bdat", "reshape2"))
s <- base::summary
h <- utils::head
n <- base::names
.First<-function() cat("\n  Welcome to R!\n\n")
.Last<-function() cat("\n  See you next time! \n\n")

----------------------------------------------------------------
where   s <- base::summary is a shortcut for summary(), which means you can put anything you want in this file and R will load and run it before your start work.


Tuesday, November 12, 2013

Here is the Zen

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.

zz from here

Thursday, August 15, 2013

R function for Multiple Imputation of Longitudinal Binary Data Based on Propensity Score

Description 

R function time1(x, m) and timex(x, inacol, nacol, m) return the imputed data based on propensity score and beta distribution (see details in reference paper). This multiple imputation is to impute the missing values according to time series and the later imputed values are dependent on the prior imputed data, which is different from R package "MICE" and SAS PROCE MI where the imputation is performed based on regression models.

Usage

time1(x, m)
timex(x, inacol, nacol, m)
 

Arguments

x              data frame of longtitudinal binary data to impute
m             number of replicates
inacol      a list of the column numbers/names before timex without NAs
nacol       the number of time before time x, for example total times is 5, to impute the time 3, nacol=2.

Examples

Copy the full R code from HERE or run the following scripts in R Gui:

Download the function code on Mac:
download.file("https://raw.github.com/devilszhang/blog/master/timex.r", destfile="./timex.r", method="curl")


Download the function code on Windows:
download.file("https://raw.github.com/devilszhang/blog/master/timex.r", destfile="./timex.r")

Source the code:

source("./timex.r")
Access the demo data from HERE or using:
require(RCurl)
x<-read.table(textConnection(getURL("https://raw.github.com/devilszhang/blog/master/longdata.txt")), header=T)
#Impute day0
x$t1imp<-time1(polio52, 5)
#Impute day4
inacol<-c("t1imp")   
nacol<-c("day4")   
x$t2imp<-timex(x, inacol, nacol, 1)   
#Impute day11
 inacol<-c("t1imp", "t2imp")
nacol<-c("day11")
x$t3imp<-timex(x, inacol, nacol, 2)   
#Impute day18
 inacol<-c("t1imp", "t2imp", "t3imp")
nacol<-c("day18")
x$t4imp<-timex(x, inacol, nacol, 3)   
#Impute day25
 inacol<-c("t1imp", "t2imp", "t3imp", "t4imp")
nacol<-c("day25")
x$t5imp<-timex(x, inacol, nacol, 4)   
#complete data
complx<-x[, c("t1imp", "t2imp", "t3imp", "t4imp", "t5imp")]
complx    


References

Li X.M. et al. Statistics in Medicine, 2006, 25: 2107-2124.

Wednesday, August 14, 2013

R Function for Simulating the Sample Size, Type I error and Power in Adaptive Group Sequential Design

Description


The R function SAP() returns the planning sample size, simulated sample size, simulated standard error, simulated arm1/arm2 event rate and standard error, simulated type I error and statistical power  for four methods (see references) with binary outcomes.

Usage


SAP(r, type1, type2, interim, tpi1, delta,p)
tra.size(type1, type2, pi1, pi2,delta) #return Chow's sample size
Gould.size(type1, type2, opi, delta) #return Gould Sample size
Shih.size(n,r, type1,type2,tpi1,tpi2,p,interim
##return Shih's sample size, n is the sample size from either Gould or Chow.


Arguments

r                   number of replicates in simulation, such as 10000
type1           designed type1 error, such as 0.05
type2           designed type2 error, such as 0.2 (power=0.8)
interim        the proportion of total information by which interim analysis is planned, such as 0.3
tpi1             designed arm 1 event rate, such as 0.3
delta            minimum clinical effect size, such as 0.2
p                 the stratification factor for Shih's method, say 0.3 suggested by the paper.

Examples


To run the functions, you either to copy all the R codes from here or copy the following scripts to the R Gui window:

Download the function code on Mac:

download.file("https://raw.github.com/devilszhang/blog/master/SAP.r", destfile="./SAP.r", method="curl")


Download the function code on Windows:


download.file("https://raw.github.com/devilszhang/blog/master/SAP.r", destfile="./SAP.r")

Source the code:

source("./SAP.r")

Suppose you're planning a trial with the desired values in Arguments:

SAP(r=10000, type1=0.05, type2=0.2, interim=0.3, tpi1=0.3, delta=0.2, p=0.3)



References


  1. Chow S.C., et al. Sample Size Calculations in Clinical Research, Marcel Dekker Inc.: New York, 2003.
  2. Gould A. L., et al. Statistics in Medicine, 1992, 11: 55-66.
  3. Shih W. J., et al. Statistics in Medicine, 1997, 16: 1913-1923. (Shih's method based on Gould)
  4. Shih's method based on Chow.

Monday, August 12, 2013

Propensity Score Imputation for Binary Outcomes


Propensity score matching (PSM) / Bootstrap imputation for binary outcomes is not available either from R package "mice" or SAS PROC MI. Here is a R scripts for PSM method plus a pooling function for analyzing the results for m bootstraps.

An example using this demo dataset.
To do the imputation, you either need to copy all R code from here and here, Or copy the following scripts to download and source the functions.

Download the R functions on Mac:
download.file("https://raw.github.com/devilszhang/blog/master/boot.imp.r", destfile="./boot.imp.r", method="curl")

download.file("https://raw.github.com/devilszhang/blog/master/boot.pool.r", destfile="./boot.pool.r", method="curl")

Download the R functions on Windows:
download.file("https://raw.github.com/devilszhang/blog/master/boot.imp.r", destfile="./boot.imp.r")

download.file("https://raw.github.com/devilszhang/blog/master/boot.pool.r", destfile="./boot.pool.r")

Source the function:
source("./boot.imp.r")
source("./boot.pool.r") 
 
#Read demo dataset from online:
     require(RCurl)
     demo<-read.table(textConnection(getURL("https://raw.github.com/devilszhang/blog/master/demo.txt")), header=T)


#STEP1:
#Calculating propensity score and grouping:
    fit1<-glm(miss~daysbirth+sex+lepidays+lepisevrt+dvomtdays, data=demo, family=binomial(link=logit), na.action="na.exclude") 

    fitted<-fit1$fitted.values
    demo$napred<-predict(fit1, type="response", newdata=demo)
    demo<-demo[order(demo$napred), ]
    brks<-with(demo, quantile(napred, probs=c(0, 0.2, 0.4, 0.6, 0.8, 1)))
    demo<-within(demo, qgroup<-cut(napred, breaks=brks, labels=1:5, include.lowest=T))
    demo<-demo[order(demo$qgroup, demo$miss), ]

#STEP2
#do bootstrap imputation for m replicates
    set.seed(123456)
    m<-10
    boot<-boot.imp(demo, "qgroup", "rotabi", m)
    boot<-as.data.frame(boot)
    colnames(boot)<-paste("boot", 1:m, sep="")
    ####complete dataset
    cdemo<-cbind(demo, boot) ##complete dataset 

#STEP3
#Pooling analysis
     fitall<-apply(boot, 2, function(x) summary(glm(x~daysbirth+sex+lepidays+lepisevrt+dvomtdays, data=demo, family=binomial(link=logit))))
    summary(boot.pool(fitall))#Lambda is the total variance due to missing values


#PLOTTING
    plot(density(na.omit(demo$rotabi)), col=4, lwd=4, ylim=c(0,10))
    nb<-paste("boot", 1:m, sep="")
    for (i in nb) {
        lines(density(cdemo[cdemo[,"miss"]==0, i]), col=2)
    }