Monday, February 17, 2014

AWK by examples

General

 $ awk -F":" '{ print "username: " $1 "\t\tobserved:" $3 }' sam.txt   
 $ awk -F":" '$3==0.222' sam.txt  


In awk, curly braces are used to group blocks of code together.  `-F ":"` indicates the field separator of `sam.txt`. `\t ` is tab separator. `$1` and `$3` are field (column) 1 and 3.

The BEGIN and END blocks

BEGIN { Actions}
{ACTION} # Action for everyline in a file
END { Actions } 

There are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it's an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you'll reference later in the program.

 awk -F":" 'BEGIN {print "Username\tDHAZ";}  
 {print $1, "\t", $3;}  
 END{print "Report List\n------------------";  
 }' sam.txt  



Monday, January 13, 2014

TypeI/II/III ANOVA

Originally posted here.

The ANOVA Controversy

ANOVA is a statistical process for analysing the amount of variance that is contributed to a sample by different factors. It was initially derived by R. A. Fisher in 1925, for the case of balanced data (equal numbers of observations for each level of a factor).
When data is unbalanced, there are different ways to calculate the sums of squares for ANOVA. There are at least 3 approaches, commonly called Type I, II and III sums of squares (this notation seems to have been introduced into the statistics world from the SAS package but is now widespread). Which type to use has led to an ongoing controversy in the field of statistics (for an overview, see Heer [2]). However, it essentially comes down to testing different hypotheses about the data.

Type I, II and III Sums of Squares

Consider a model that includes two factors A and B; there are therefore two main effects, and an interaction, AB. The full model is represented by SS(A, B, AB).
Other models are represented similarly: SS(A, B) indicates the model with no interaction, SS(B, AB) indicates the model that does not account for effects from factor A, and so on.
The influence of particular factors (including interactions) can be tested by examining the differences between models. For example, to determine the presence of an interaction effect, an F-test of the models SS(A, B, AB) and the no-interaction model SS(A, B) would be carried out.
It is convenient to define incremental sums of squares to represent these differences. Let
SS(AB | A, B) = SS(A, B, AB) - SS(A, B)
SS(A | B, AB) = SS(A, B, AB) - SS(B, AB)
SS(B | A, AB) = SS(A, B, AB) - SS(A, AB)
SS(A | B)     = SS(A, B) - SS(B)
SS(B | A)     = SS(A, B) - SS(A)
The notation shows the incremental differences in sums of squares, for example SS(AB | A, B) represents "the sum of squares for interaction after the main effects", and SS(A | B) is "the sum of squares for the A main effect after the B main effect and ignoring interactions" [1]. The different types of sums of squares then arise depending on the stage of model reduction at which they are carried out. In particular:
  • Type I, also called "sequential" sum of squares:
    • SS(A) for factor A.
    • SS(B | A) for factor B.
    • SS(AB | B, A) for interaction AB.
    • This tests the main effect of factor A, followed by the main effect of factor B after the main effect of A, followed by the interaction effect AB after the main effects.
    • Because of the sequential nature and the fact that the two main factors are tested in a particular order, this type of sums of squares will give different results for unbalanced data depending on which main effect is considered first.
    • For unbalanced data, this approach tests for a difference in the weighted marginal means. In practical terms, this means that the results are dependent on the realized sample sizes, namely the proportions in the particular data set. In other words, it is testing the first factor without controlling for the other factor (for further discussion and a worked example, see Zahn [4]).
    • Note that this is often not the hypothesis that is of interest when dealing with unbalanced data.
  • Type II:
    • SS(A | B) for factor A.
    • SS(B | A) for factor B.
    • This type tests for each main effect after the other main effect.
    • Note that no significant interaction is assumed (in other words, you should test for interaction first (SS(AB | A, B)) and only if AB is not significant, continue with the analysis for main effects).
    • If there is indeed no interaction, then type II is statistically more powerful than type III (see Langsrud [3] for further details).
    • Computationally, this is equivalent to running a type I analysis with different orders of the factors, and taking the appropriate output (the second, where one main effect is run after the other, in the example above).
  • Type III:
    • SS(A | B, AB) for factor A.
    • SS(B | A, AB) for factor B.
    • This type tests for the presence of a main effect after the other main effect and interaction. This approach is therefore valid in the presence of significant interactions.
    • However, it is often not interesting to interpret a main effect if interactions are present (generally speaking, if a significant interaction is present, the main effects should not be further analysed).
    • If the interactions are not significant, type II gives a more powerful test.
When data is balanced, the factors are orthogonal, and types I, II and III all give the same results.

Summary

  • Usually the hypothesis of interest is about the significance of one factor while controlling for the level of the other factors. If the data is unbalanced, this equates to using type II or III SS.
  • In general, if there is no significant interaction effect, then type II is more powerful, and follows the principle of marginality.
  • If interaction is present, then type II is inappropriate while type III can still be used, but results need to be interpreted with caution (in the presence of interactions, main effects are rarely interpretable).

The anova and aov Functions in R

The anova and aov functions in R implement a sequential sum of squares (type I). As indicated above, for unbalanced data, this rarely tests a hypothesis of interest, since essentially the effect of one factor is calculated based on the varying levels of the other factor. In a practical sense, this means that the results are interpretable only in relation to the particular levels of observations that occur in the (unbalanced) data set. Fortunately, based on the above discussion, it should be clear that it is relatively straightforward to obtain type II SS in R.

Type II SS in R

Since type II SS tests each main effect after the other main effects, and assumes no interactions, the correct SS can be obtained using anova() and varying the order of the factors.
For example, consider a data frame (search) for which the response variable is the time that it takes users to find a relevant answer with an information retreival system (time). The user is assigned to one of two experimental search systems on which they run the test (sys). They are also assigned a number of different search queries (topic).
To obtain type I SS:
anova(lm(time ~ sys * topic, data=search))
If the data is unbalanced, you will obtain slightly different results if you instead use:
anova(lm(time ~  topic * sys, data=search))
The type II SS is obtained by using the second line of output from each of the above commands (since in type I SS, the second component will be the second factor, after the first factor). That is, you obtain the type II SS results for topic from the first command, and the results for sys from the second.

Type III SS in R

This is slightly more involved than the type II results.
First, it is necessary to set the contrasts option in R. Because the multi-way ANOVA model is over-parameterised, it is necessary to choose a contrasts setting that sums to zero, otherwise the ANOVA analysis will give incorrect results with respect to the expected hypothesis. (The default contrasts type does not satisfy this requirement.)
options(contrasts = c("contr.sum","contr.poly"))
Next, store the model:
model <- lm(time ~  topic * sys, data=search)
Finally, call the drop1 function on each model component:
drop1(model, .~., test="F")
The results give the type III SS, including the p-values from an F-test.

Type II and III SS Using the car Package

A somewhat easier way to obtain type II and III SS is through the car package. This defines a new function, Anova(), which can calculate type II and III SS directly.
Type II, using the same data set defined above:
Anova(lm(time ~  topic * sys, data=search), type=2)
Type III:
Anova(lm(time ~  topic * sys, data=search, contrasts=list(topic=contr.sum, sys=contr.sum)), type=3)
Due to the way in which the SS are calculated when incorporating the interaction effect, for type III you must specify the contrasts option to obtain sensible results (an explanation is given here).

References

[1] John Fox. "Applied Regression Analysis and Generlized Linear Models", 2nd ed., Sage, 2008.
[2] David G. Herr. "On the History of ANOVA in Unbalanced, Factorial Designs: The First 30 Years", The American Statistician, Vol. 40, No. 4, pp. 265-270, 1986.
[3] Oyvind Langsrud. "ANOVA for unbalanced data: Use Type II instead of Type III sums of squares", Statistics and Computing, Volume 13, Number 2, pp. 163-167, 2003.
[4] Ista Zahn. "Working with unbalanced cell sizes in multiple regression with categorical predictors", 2009. prometheus.scp.rochester.edu/zlab/sites/default/files/InteractionsAndTypesOfSS.pdf

Tuesday, December 17, 2013

R packages and functions for multivariate analysis

package::function #comments
  • Visualization of multivariate data 
graphics::pairs(), stars(), mosaicplot()
graphics::coplot() #conditioning plot
lattice::xyplot(), splom()
car::scatterplot.matrix()
scatterplot3d::scatterplot3d()
aplpack::spin3R(), faces()
MASS::parcoord
ade4::mstree()
vegan::spantree()
ellipse::plotcorr()
vcd::mosaic()
gclus:: #cluster specific graphical enhancements for scatter plots
xgobi::
rggobi::
  •  Hypothesis testing
ICSNP:: #HotellingsT2 test, non-parametric
cramer::
SpatialNP::
  •  Multivariate distributions
stats::cov(), cor()
INSNP::spatial.median()
MASS::cov.rob()
covRobust::
robustbase::covMCD(), covOGK()
rrcov::
mvtnorm:: #simulation
mnormt::
sn:: #for skew normal and t
ks::rmvnorm.mixt(), dmvnorm.mixt() #comprehensive information on mixtures;
bayesm::rwishart() #Wishart distribution
MCMCpack::rwish()

#multivariate normality test
MVN::HZ.test() #Henze-Zirkler’s Multivariate Normality Test
MVN::mardia.test() #Mardia’s Multivariate Normality Test
MVN::royston.test() #Royston’s Multivariate Normality Test
mvnormtest::mshapiro.test() #Shapiro-Wilk multivariate normality Test
mvoutlier::
energy::mvnorm.etest(), k.sample()
stats::mauchly.test

#Copulas
copula:: #generalized archimedian copula
  •  Linear models
stats::lm() #wiht matrix specified as dependent variable.
stats::anova.mlm(), manova()
PenLNM:: #penalized logistic normal multinomial regression.
sn::msn.mle(), mst.mle() #fit multivariate skew normal and skew t model.
pls:: #partial least squares regression, principle component regression
ppls:: #panelized partial least squares.
dr:: #dimension reduction regression, options: "sir", "save"
plsgenomices::
relaimpo: #relative importance of regression parameters

  • Projection methods
#principal components
stats::prcomp() based on svd(), princomp() based on eigen(). #the former is preferred.
sca::
Hmisc::pc1()
paran:: #Horn's evaluation of the number of dimensions to retain
pcurve:: #principle curve analysis/visualization
gmodels::fast.prcomp(), fast.svd #wide matrices
kernlab::kpca #non-linear principle components.
pcaPP::acpgen(), acprob()
psy::sphpca() #maps into a sphere, fpca(), scree.plot() #some variables as dependent
#Canonical correlation:
stats::cancor()
kernlab::kcca()
corcor::
#Redundancey analysis
calibrate::rda()
fso::
stats::cmdscale()
SensoMineR::MDS.indscal()

  • Classification
#unsupervised
Cluster::
kmeans::hclust()
cluster::
clv::
trimcluster::
clue::
clusterSim::
hybridHclust::
energy::edist(), hclust.energy()
kohonen::
clusterGeneration::
mclust::
MachineLearning:: #tree
rpart::
TWIX::
mvpart::
party::
caret:: #classification and regression training
kknn:: #k-nearest

#supervised
MASS::lda(), qda()
mda::mda() #mixture and flexible discriminant analysis
mars::fda() #multivariate adaptive regression splines. bruto() #adaptive spline backfitting
earth::
rda::
class:: knn()
SensoMineR::FDA()
klaR:: #variable selection and robustness against multicollinearity and visualization
superpc:: #supervised pca
hddplot:: #cross-validated linear discriminant.
ROCR:: #assessing classifier performance

  • Corresponding analysis
MASS::mca(), corresp()
ca::ca()
ade4::mca(), hta()
FactoMineR::CA(), MCA()
homals::


  • Modeling non-Gaussian data
MNP::
polycor::
bayesm::
VGAM::
  • Matrix manipulations
Matrix::
SparseM::
matrixcalc:: matrix differential calculus.
spam::

Monday, December 9, 2013

Test the content of two variables from different files in Shell

If you're working on large datasets through Linux Shell, you may sometime need to check whether a variable/column of file1 is in a certain column of file2. For example, file1 has one column with a list of 6,000 SNPs and file2 has five columns with 250,000 SNPs in the second column. You would check whether the 6000 SNPs are part of the 250,000 SNPs. The AWK function built in Shell has a easy way to realize that:

awk 'FNR==NR {a[$1]=$2; next}{print $1 a[$1]}' file2 file1 | wc -l


Thursday, November 28, 2013

Add python to Windows 7 DOS Path

Run Python from Window 7 DOS is not as straightforward as in Mac Os X.  You may meet this prompt if you haven't set the python path in DOS:

 C:\Users\dz2t >python
'python' is not recognized as an internal or external command,
operable program or batch file.

Here is a way to solve this problem:
click: Start --> Control Panel --> System and Security -->System

On the left side of the screen, you should be able to see 'Advanced system settings', and then click it.
Then at the bottom click 'Environment Variables'. In the System variables box, find and click something like 'Path    C:....'. Finally, in the pop 'Edit System Variable' box, you need to add the python path to the 'Variable value'. Usually add this text to the end of the box:

;C:\Python33;C:\Python33\scripts

Then from the DOS window you can evoke python directly.


Tuesday, November 19, 2013

Setting R Path in Mac OS for Latex

There's a nice latex() function from R package 'Hmisc' to generate LaTex tables automatically,  but the function might not work for Mac OS X platform due to the path environmental variables in R isn’t set to include the path to TeX binaries.

Here, we can see that the R path does not include the path to TeX binaries:
> Sys.getenv("PATH")
[1] "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin"
 

So, if you try to call the latex() function, it will give an error message like that:

library(Hmisc) # Load Hmisc to load latex()
x <- matrix(1:6, nrow=2, dimnames=list(c('a','b'),c('c','d','this that'))) # From latex() examples
latex(x)
error /bin/sh: latex: command not found
error in system(cmd, intern = TRUE, wait = TRUE) :
error in running command
sh: xdvi: command not found
To add the Tex PATH into the R envirnment, following my last blog to create .Rprofile and add the following command to the file:
Sys.setenv(PATH=paste(Sys.getenv("PATH"),"/usr/texbin",sep=":"))
After reboot R, you will find the tex path was added.
> Sys.getenv("PATH")
[1] "/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/texbin"
 

Monday, November 18, 2013

Add default packages in R

Checking around seems there's no a detail description on how to add your frequent-using-packages as your default loadings in R. Here is an easy way to make it (as least works for MAC OS).

By launching R you will see this message:
----------------------------------------------------------------
...
[R.app GUI 1.61 (6492) x86_64-apple-darwin10.8.0]

[Workspace restored from /Users/dz2t/.RData]
[History restored from /Users/dz2t/.Rapp.history]


----------------------------------------------------------------

Here, the directory ''/Users/dz2t/" is the HOME folder that R will search and run when it's initiating. To load a set of default stuffs (not only packages), what you need to do is just to create a ".Rprofile" file in this directory. I use the mac terminal to create this file and I guess you can also use any text editor to do so and save it under an exact name: .Rprofile. And finally the most important and also a easy thing next to do is to list the libraries in the file and save it like that:
----------------------------------------------------------------
options(width=65, digits=5)
options(defaultPackages=c(getOption("defaultPackages"), "ggplot2", "mice", "grid", "gtools",
                        "gdata", "pscl", "zoo", "car", "DirichletReg", "plyr", "bestglm", "tree",
                        "betareg", "psych", "gee", "ggdendro", "sas7bdat", "reshape2"))
s <- base::summary
h <- utils::head
n <- base::names
.First<-function() cat("\n  Welcome to R!\n\n")
.Last<-function() cat("\n  See you next time! \n\n")

----------------------------------------------------------------
where   s <- base::summary is a shortcut for summary(), which means you can put anything you want in this file and R will load and run it before your start work.


Tuesday, November 12, 2013

Here is the Zen

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.

zz from here

Thursday, August 15, 2013

R function for Multiple Imputation of Longitudinal Binary Data Based on Propensity Score

Description 

R function time1(x, m) and timex(x, inacol, nacol, m) return the imputed data based on propensity score and beta distribution (see details in reference paper). This multiple imputation is to impute the missing values according to time series and the later imputed values are dependent on the prior imputed data, which is different from R package "MICE" and SAS PROCE MI where the imputation is performed based on regression models.

Usage

time1(x, m)
timex(x, inacol, nacol, m)
 

Arguments

x              data frame of longtitudinal binary data to impute
m             number of replicates
inacol      a list of the column numbers/names before timex without NAs
nacol       the number of time before time x, for example total times is 5, to impute the time 3, nacol=2.

Examples

Copy the full R code from HERE or run the following scripts in R Gui:

Download the function code on Mac:
download.file("https://raw.github.com/devilszhang/blog/master/timex.r", destfile="./timex.r", method="curl")


Download the function code on Windows:
download.file("https://raw.github.com/devilszhang/blog/master/timex.r", destfile="./timex.r")

Source the code:

source("./timex.r")
Access the demo data from HERE or using:
require(RCurl)
x<-read.table(textConnection(getURL("https://raw.github.com/devilszhang/blog/master/longdata.txt")), header=T)
#Impute day0
x$t1imp<-time1(polio52, 5)
#Impute day4
inacol<-c("t1imp")   
nacol<-c("day4")   
x$t2imp<-timex(x, inacol, nacol, 1)   
#Impute day11
 inacol<-c("t1imp", "t2imp")
nacol<-c("day11")
x$t3imp<-timex(x, inacol, nacol, 2)   
#Impute day18
 inacol<-c("t1imp", "t2imp", "t3imp")
nacol<-c("day18")
x$t4imp<-timex(x, inacol, nacol, 3)   
#Impute day25
 inacol<-c("t1imp", "t2imp", "t3imp", "t4imp")
nacol<-c("day25")
x$t5imp<-timex(x, inacol, nacol, 4)   
#complete data
complx<-x[, c("t1imp", "t2imp", "t3imp", "t4imp", "t5imp")]
complx    


References

Li X.M. et al. Statistics in Medicine, 2006, 25: 2107-2124.

Wednesday, August 14, 2013

R Function for Simulating the Sample Size, Type I error and Power in Adaptive Group Sequential Design

Description


The R function SAP() returns the planning sample size, simulated sample size, simulated standard error, simulated arm1/arm2 event rate and standard error, simulated type I error and statistical power  for four methods (see references) with binary outcomes.

Usage


SAP(r, type1, type2, interim, tpi1, delta,p)
tra.size(type1, type2, pi1, pi2,delta) #return Chow's sample size
Gould.size(type1, type2, opi, delta) #return Gould Sample size
Shih.size(n,r, type1,type2,tpi1,tpi2,p,interim
##return Shih's sample size, n is the sample size from either Gould or Chow.


Arguments

r                   number of replicates in simulation, such as 10000
type1           designed type1 error, such as 0.05
type2           designed type2 error, such as 0.2 (power=0.8)
interim        the proportion of total information by which interim analysis is planned, such as 0.3
tpi1             designed arm 1 event rate, such as 0.3
delta            minimum clinical effect size, such as 0.2
p                 the stratification factor for Shih's method, say 0.3 suggested by the paper.

Examples


To run the functions, you either to copy all the R codes from here or copy the following scripts to the R Gui window:

Download the function code on Mac:

download.file("https://raw.github.com/devilszhang/blog/master/SAP.r", destfile="./SAP.r", method="curl")


Download the function code on Windows:


download.file("https://raw.github.com/devilszhang/blog/master/SAP.r", destfile="./SAP.r")

Source the code:

source("./SAP.r")

Suppose you're planning a trial with the desired values in Arguments:

SAP(r=10000, type1=0.05, type2=0.2, interim=0.3, tpi1=0.3, delta=0.2, p=0.3)



References


  1. Chow S.C., et al. Sample Size Calculations in Clinical Research, Marcel Dekker Inc.: New York, 2003.
  2. Gould A. L., et al. Statistics in Medicine, 1992, 11: 55-66.
  3. Shih W. J., et al. Statistics in Medicine, 1997, 16: 1913-1923. (Shih's method based on Gould)
  4. Shih's method based on Chow.

Monday, August 12, 2013

Propensity Score Imputation for Binary Outcomes


Propensity score matching (PSM) / Bootstrap imputation for binary outcomes is not available either from R package "mice" or SAS PROC MI. Here is a R scripts for PSM method plus a pooling function for analyzing the results for m bootstraps.

An example using this demo dataset.
To do the imputation, you either need to copy all R code from here and here, Or copy the following scripts to download and source the functions.

Download the R functions on Mac:
download.file("https://raw.github.com/devilszhang/blog/master/boot.imp.r", destfile="./boot.imp.r", method="curl")

download.file("https://raw.github.com/devilszhang/blog/master/boot.pool.r", destfile="./boot.pool.r", method="curl")

Download the R functions on Windows:
download.file("https://raw.github.com/devilszhang/blog/master/boot.imp.r", destfile="./boot.imp.r")

download.file("https://raw.github.com/devilszhang/blog/master/boot.pool.r", destfile="./boot.pool.r")

Source the function:
source("./boot.imp.r")
source("./boot.pool.r") 
 
#Read demo dataset from online:
     require(RCurl)
     demo<-read.table(textConnection(getURL("https://raw.github.com/devilszhang/blog/master/demo.txt")), header=T)


#STEP1:
#Calculating propensity score and grouping:
    fit1<-glm(miss~daysbirth+sex+lepidays+lepisevrt+dvomtdays, data=demo, family=binomial(link=logit), na.action="na.exclude") 

    fitted<-fit1$fitted.values
    demo$napred<-predict(fit1, type="response", newdata=demo)
    demo<-demo[order(demo$napred), ]
    brks<-with(demo, quantile(napred, probs=c(0, 0.2, 0.4, 0.6, 0.8, 1)))
    demo<-within(demo, qgroup<-cut(napred, breaks=brks, labels=1:5, include.lowest=T))
    demo<-demo[order(demo$qgroup, demo$miss), ]

#STEP2
#do bootstrap imputation for m replicates
    set.seed(123456)
    m<-10
    boot<-boot.imp(demo, "qgroup", "rotabi", m)
    boot<-as.data.frame(boot)
    colnames(boot)<-paste("boot", 1:m, sep="")
    ####complete dataset
    cdemo<-cbind(demo, boot) ##complete dataset 

#STEP3
#Pooling analysis
     fitall<-apply(boot, 2, function(x) summary(glm(x~daysbirth+sex+lepidays+lepisevrt+dvomtdays, data=demo, family=binomial(link=logit))))
    summary(boot.pool(fitall))#Lambda is the total variance due to missing values


#PLOTTING
    plot(density(na.omit(demo$rotabi)), col=4, lwd=4, ylim=c(0,10))
    nb<-paste("boot", 1:m, sep="")
    for (i in nb) {
        lines(density(cdemo[cdemo[,"miss"]==0, i]), col=2)
    }