Exercise 3 Logistic regression and stratified analysis

In this exercise we will explore how R handles generalised linear models using the example of logistic regression. We will continue using the salex dataset. Start R and retrieve the salex dataset:

 

salex <- read.table("salex.dat", header = TRUE, na.strings = "9")

 

When we analysed this data using two-by-two tables and examining the risk ratio and 95% confidence interval associated with each exposure we found many significant positive associations:

 

Variable RR 95% CI
EGGS 2.61 1.55, 4.38
MUSHROOM 1.41 1.03, 1.93
PEPPER 1.74 1.27, 2.38
PASTA 1.68 1.26, 2.26
RICE 1.72 1.25, 2.34
LETTUCE 2.01 1.49, 2.73
COLESLAW 1.89 1.37, 2.64
CHOCOLATE 1.39 1.05, 1.87

 

Some of these associations may be due to confounding in the data. We can use logistic regression to help us identify independent associations.

Logistic regression requires the dependent variable to be either 0 or 1. In order to perform a logistic regression we must first recode the ILL variable so that 0=no and 1=yes:

 

table(salex$ILL)
salex$ILL[salex$ILL == 2] <- 0
table(salex$ILL)
## 
##  1  2 
## 51 26
## 
##  0  1 
## 26 51

 

We could work with our data as it is but if we wanted to calculate odds ratios and confidence intervals we would calculate their reciprocals (i.e. odds ratios for non-exposure rather than for exposure). This is because of the way the data has been coded (1=yes, 2=no).

In order to calculate meaningful odds ratios the exposure variables should also be coded 0=no, 1=yes. The actual codes used are not important as long as the value used for ‘exposed’ is one greater than the value used for ‘not exposed.’

We could issue a series of commands similar to the one we have just used to recode the ILL variable. This is both tedious and unnecessary as the structure of the dataset (i.e. all variables are coded identically) allows us to recode all variables with a single command:

 

salex <- read.table("salex.dat", header = TRUE, na.strings = "9")
salex[1:5, ]
##   ILL HAM BEEF EGGS MUSHROOM PEPPER PORKPIE PASTA RICE LETTUCE TOMATO COLESLAW
## 1   1   1    1    1        1      1       2     2    2       2      2        2
## 2   1   1    1    1        2      2       1     2    2       2      1        2
## 3   1   1    1    1        1      1       1     1    1       1      2        2
## 4   1   1    1    1        2      2       2     2    2       1      1        2
## 5   1   1    1    1        1      1       1     1    1       1      1        1
##   CRISPS PEACHCAKE CHOCOLATE FRUIT TRIFLE ALMONDS
## 1      2         2         2     2      2       2
## 2      2         2         2     2      2       2
## 3      1         2         1     2      2       2
## 4      2         2         1     2      2       2
## 5      2         2         1     2      1       2
salex <- 2 - salex
salex[1:5, ]
##   ILL HAM BEEF EGGS MUSHROOM PEPPER PORKPIE PASTA RICE LETTUCE TOMATO COLESLAW
## 1   1   1    1    1        1      1       0     0    0       0      0        0
## 2   1   1    1    1        0      0       1     0    0       0      1        0
## 3   1   1    1    1        1      1       1     1    1       1      0        0
## 4   1   1    1    1        0      0       0     0    0       1      1        0
## 5   1   1    1    1        1      1       1     1    1       1      1        1
##   CRISPS PEACHCAKE CHOCOLATE FRUIT TRIFLE ALMONDS
## 1      0         0         0     0      0       0
## 2      0         0         0     0      0       0
## 3      1         0         1     0      0       0
## 4      0         0         1     0      0       0
## 5      0         0         1     0      1       0

 

WARNING : The attach() function works with a copy of the data.frame rather than the original data.frame. Commands that manipulate variables in a data.frame may not work as expected if the data.frame has been attached using the attach() function.

It is better to manipulate data before attaching a data.frame. The detach() function may be used to remove an attachment prior to any data manipulation.

Many R users avoid using the attach() function altogether.

We can now use the generalised linear model glm() function to specify the logistic regression model:

 

salex.lreg <- glm(formula = ILL ~ EGGS + MUSHROOM + PEPPER + PASTA +
                  RICE + LETTUCE + COLESLAW + CHOCOLATE,
                  family = binomial(logit), data = salex)

 

The method used by the glm() function is defined by the family parameter. Here we specify binomial errors and a logit (logistic) linking function.

We have saved the output of the glm() function in the salex.lreg object. We can examine some basic information about the specified model using the summary() function:

 

summary(salex.lreg)
## 
## Call:
## glm(formula = ILL ~ EGGS + MUSHROOM + PEPPER + PASTA + RICE + 
##     LETTUCE + COLESLAW + CHOCOLATE, family = binomial(logit), 
##     data = salex)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.92036  -0.49869   0.06877   0.40906   2.07182  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -2.021864   0.676606  -2.988  0.00281 **
## EGGS         3.579366   1.267870   2.823  0.00476 **
## MUSHROOM    -3.584345   1.728999  -2.073  0.03817 * 
## PEPPER       2.348074   1.428177   1.644  0.10015   
## PASTA        1.774818   1.162762   1.526  0.12692   
## RICE         0.114180   1.193840   0.096  0.92381   
## LETTUCE      3.401828   1.234060   2.757  0.00584 **
## COLESLAW     0.763857   1.024373   0.746  0.45586   
## CHOCOLATE    0.009782   1.314683   0.007  0.99406   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 91.246  on 69  degrees of freedom
## Residual deviance: 41.260  on 61  degrees of freedom
##   (7 observations deleted due to missingness)
## AIC: 59.26
## 
## Number of Fisher Scoring iterations: 7

 

We will use backwards elimination to remove non-significant variables from the logistic regression model. Remember that previous commands can be recalled and edited using the up and down arrow keys – they do not need to be typed out in full each time.

CHOCOLATE is the least significant variable in the model so we will remove this variable from the model. Storing the output of the glm() function is useful as it allows us to use the update() function to add, remove, or modify variables without having to describe the model in full:

 

salex.lreg <- update(salex.lreg, . ~ . - CHOCOLATE)
summary(salex.lreg)
## 
## Call:
## glm(formula = ILL ~ EGGS + MUSHROOM + PEPPER + PASTA + RICE + 
##     LETTUCE + COLESLAW, family = binomial(logit), data = salex)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.92561  -0.49859   0.07555   0.38723   2.07200  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -2.0223     0.6623  -3.053  0.00226 **
## EGGS          3.5890     1.2188   2.945  0.00323 **
## MUSHROOM     -3.5992     1.6885  -2.132  0.03305 * 
## PEPPER        2.3544     1.4275   1.649  0.09910 . 
## PASTA         1.7770     1.1215   1.585  0.11308   
## RICE          0.1170     1.1388   0.103  0.91819   
## LETTUCE       3.4109     1.2316   2.770  0.00561 **
## COLESLAW      0.7630     1.0224   0.746  0.45547   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 92.122  on 70  degrees of freedom
## Residual deviance: 41.273  on 63  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 57.273
## 
## Number of Fisher Scoring iterations: 7

RICE is now the least significant variable in the model so we will remove this variable from the model:

 

salex.lreg <- update(salex.lreg, . ~ . - RICE)
summary(salex.lreg)
## 
## Call:
## glm(formula = ILL ~ EGGS + MUSHROOM + PEPPER + PASTA + LETTUCE + 
##     COLESLAW, family = binomial(logit), data = salex)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8877  -0.4999   0.0786   0.3897   2.0697  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -2.0169     0.6600  -3.056  0.00224 **
## EGGS          3.6142     1.1944   3.026  0.00248 **
## MUSHROOM     -3.5508     1.6134  -2.201  0.02774 * 
## PEPPER        2.3002     1.3200   1.743  0.08141 . 
## PASTA         1.8230     1.0280   1.773  0.07617 . 
## LETTUCE       3.4199     1.2273   2.787  0.00533 **
## COLESLAW      0.7611     1.0203   0.746  0.45571   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 92.122  on 70  degrees of freedom
## Residual deviance: 41.283  on 64  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 55.283
## 
## Number of Fisher Scoring iterations: 6

COLESLAW is now the least significant variable in the model so we will remove this variable from the model:

 

salex.lreg <- update(salex.lreg, . ~ . - COLESLAW)
summary(salex.lreg)
## 
## Call:
## glm(formula = ILL ~ EGGS + MUSHROOM + PEPPER + PASTA + LETTUCE, 
##     family = binomial(logit), data = salex)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.98481  -0.50486   0.08871   0.36910   2.06065  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -1.9957     0.6545  -3.049  0.00230 **
## EGGS          3.8152     1.1640   3.278  0.00105 **
## MUSHROOM     -3.4008     1.5922  -2.136  0.03269 * 
## PEPPER        2.3520     1.3269   1.773  0.07631 . 
## PASTA         1.9706     0.9922   1.986  0.04701 * 
## LETTUCE       3.4786     1.2246   2.841  0.00450 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 92.982  on 71  degrees of freedom
## Residual deviance: 41.895  on 66  degrees of freedom
##   (5 observations deleted due to missingness)
## AIC: 53.895
## 
## Number of Fisher Scoring iterations: 6

PEPPER is now the least significant variable in the model so we will remove this variable from the model:

 

salex.lreg <- update(salex.lreg, . ~ . - PEPPER)
summary(salex.lreg)
## 
## Call:
## glm(formula = ILL ~ EGGS + MUSHROOM + PASTA + LETTUCE, family = binomial(logit), 
##     data = salex)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0920  -0.5360   0.1109   0.4876   2.0056  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.8676     0.6128  -3.048 0.002306 ** 
## EGGS          3.7094     1.0682   3.473 0.000515 ***
## MUSHROOM     -1.6165     1.0829  -1.493 0.135524    
## PASTA         1.8440     0.9193   2.006 0.044864 *  
## LETTUCE       3.2458     1.1698   2.775 0.005527 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 94.659  on 73  degrees of freedom
## Residual deviance: 45.578  on 69  degrees of freedom
##   (3 observations deleted due to missingness)
## AIC: 55.578
## 
## Number of Fisher Scoring iterations: 6

MUSHROOM is now the least significant variable in the model so we will remove this variable from the model:

 

salex.lreg <- update(salex.lreg, . ~ . - MUSHROOM)
summary(salex.lreg)
## 
## Call:
## glm(formula = ILL ~ EGGS + PASTA + LETTUCE, family = binomial(logit), 
##     data = salex)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2024  -0.5108   0.2038   0.4304   2.0501  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.9710     0.6146  -3.207  0.00134 ** 
## EGGS          2.6391     0.7334   3.599  0.00032 ***
## PASTA         1.6646     0.8376   1.987  0.04689 *  
## LETTUCE       3.1956     1.1516   2.775  0.00552 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 97.648  on 75  degrees of freedom
## Residual deviance: 50.529  on 72  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 58.529
## 
## Number of Fisher Scoring iterations: 6

 

There are now no non-significant variables in the model.

Unfortunately R does not present information on the model coefficients in terms of odds ratios and confidence intervals but we can write a function to calculate them for us.

The first step in doing this is to realise that the salex.lreg object contains essential information about the fitted model. To calculate odds ratios and confidence intervals we need the regression coefficients and their standard errors. Both:

 

summary(salex.lreg)$coefficients
##              Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.970967  0.6145691 -3.207071 0.0013409398
## EGGS         2.639115  0.7333899  3.598515 0.0003200388
## PASTA        1.664581  0.8375970  1.987330 0.0468858898
## LETTUCE      3.195594  1.1516159  2.774879 0.0055222320

 

and:

 

coef(summary(salex.lreg))
##              Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.970967  0.6145691 -3.207071 0.0013409398
## EGGS         2.639115  0.7333899  3.598515 0.0003200388
## PASTA        1.664581  0.8375970  1.987330 0.0468858898
## LETTUCE      3.195594  1.1516159  2.774879 0.0055222320

 

extract the data that we require. The preferred method is to use the coef() function. This is because some fitted models may return coefficients in a more complicated manner than (e.g.) those created by the glm() function. The coef() function provides a standard way of extracting this data from all classes of fitted objects.

We can store the coefficients data in a separate object to make it easier to work with:

 

salex.lreg.coeffs <- coef(summary(salex.lreg))
salex.lreg.coeffs
##              Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -1.970967  0.6145691 -3.207071 0.0013409398
## EGGS         2.639115  0.7333899  3.598515 0.0003200388
## PASTA        1.664581  0.8375970  1.987330 0.0468858898
## LETTUCE      3.195594  1.1516159  2.774879 0.0055222320

 

We can extract information from this object by addressing each piece of information by its row and column position in the object. For example:

 

salex.lreg.coeffs[2,1]
## [1] 2.639115

 

Is the regression coefficient for EGGS, and:

 

salex.lreg.coeffs[3,2]
## [1] 0.837597

 

is the standard error of the regression coefficient for PASTA. Similarly:

 

salex.lreg.coeffs[ ,1]
## (Intercept)        EGGS       PASTA     LETTUCE 
##   -1.970967    2.639115    1.664581    3.195594

Returns the regression coefficients for all of the variables in the model, and:

 

salex.lreg.coeffs[ ,2]
## (Intercept)        EGGS       PASTA     LETTUCE 
##   0.6145691   0.7333899   0.8375970   1.1516159

 

Returns the standard errors of the regression coefficients.

The table below shows the indices that address each cell in the table of regression coefficients:

 

matrix(salex.lreg.coeffs, nrow = 4, ncol = 4)
##           [,1]      [,2]      [,3]         [,4]
## [1,] -1.970967 0.6145691 -3.207071 0.0013409398
## [2,]  2.639115 0.7333899  3.598515 0.0003200388
## [3,]  1.664581 0.8375970  1.987330 0.0468858898
## [4,]  3.195594 1.1516159  2.774879 0.0055222320

 

We can use this information to calculate odds ratio sand 95% confidence intervals:

 

or <- exp(salex.lreg.coeffs[ ,1])
lci <- exp(salex.lreg.coeffs[ ,1] - 1.96 * salex.lreg.coeffs[ ,2])
uci <- exp(salex.lreg.coeffs[ ,1] + 1.96 * salex.lreg.coeffs[ ,2])

 

and make a single object that contains all of the required information:

 

lreg.or <- cbind(or, lci, uci)
lreg.or
##                     or       lci         uci
## (Intercept)  0.1393221 0.0417723   0.4646777
## EGGS        14.0008053 3.3256684  58.9423019
## PASTA        5.2834608 1.0231552  27.2832114
## LETTUCE     24.4246856 2.5559581 233.4018193

 

We seldom need to report estimates and confidence intervals to more than two decimal places. We can use the round() function to remove the excess digits:

 

round(lreg.or, digits = 2)
##                or  lci    uci
## (Intercept)  0.14 0.04   0.46
## EGGS        14.00 3.33  58.94
## PASTA        5.28 1.02  27.28
## LETTUCE     24.42 2.56 233.40

 

We have now gone through all the necessary calculations step-by-step but it would be nice to have a function that did it all for us that we could use whenever we needed to.

First we will create a template for the function:

 

lreg.or <- function(model, digits = 2) {}

 

and then use the fix() function to edit the lreg.or() function:

 

fix(lreg.or)

 

We can now edit this function to add a calculation of odds ratios and 95% confidence intervals:

 

function(model, digits = 2) {
  lreg.coeffs <- coef(summary(model))
  OR <- exp(lreg.coeffs[ ,1])
  LCI <- exp(lreg.coeffs[ ,1] - 1.96 * lreg.coeffs[ ,2])
  UCI <- exp(lreg.coeffs[ ,1] + 1.96 * lreg.coeffs[ ,2])
  lreg.or <- round(cbind(OR, LCI, UCI), digits = digits)
  lreg.or
}

Once you have made the changes shown above, check your work, save the file, and quit the editor.

We can test our function:

 

lreg.or(salex.lreg)

 

Which produces the following output:

 

lreg.or(salex.lreg)
##                OR  LCI    UCI
## (Intercept)  0.14 0.04   0.46
## EGGS        14.00 3.33  58.94
## PASTA        5.28 1.02  27.28
## LETTUCE     24.42 2.56 233.40

 

The digits parameter of the lreg.or() function, which has digits = 2 as its default value, allows us to specify the precision with which the estimates and their confidence intervals are reported:

 

lreg.or(salex.lreg, digits = 4)
##                  OR    LCI      UCI
## (Intercept)  0.1393 0.0418   0.4647
## EGGS        14.0008 3.3257  58.9423
## PASTA        5.2835 1.0232  27.2832
## LETTUCE     24.4247 2.5560 233.4018

 

Before we continue, it is probably a good idea to save this function for later use:

 

save(lreg.or, file = "lregor.r")

Which can be reloaded whenever it is needed:

 

load("lregor.r")

 

An alternative to using logistic regression with data that contains associations that may be due to confounding is to use stratified analysis (i.e. Mantel-Haenszel techniques). With several potential confounders, a stratified analysis results in the analysis of many tables which can be difficult to interpret. For example, four potential confounders, each with two levels would produce sixteen tables. In such situations, logistic regression might be a better approach. In order to illustrate Mantel-Haenszel techniques in R we will work with a simpler dataset.

On Saturday, 21st April 1990, a luncheon was held in the home of Jean Bateman. There was a total of forty-five guests which included thirty-five members of the Department of Epidemiology and Population Sciences at the London School of Hygiene and Tropical Medicine. On Sunday morning, 22nd April 1990, Jean awoke with symptoms of gastrointestinal illness; her husband awoke with similar symptoms. The possibility of an outbreak related to the luncheon was strengthened when several of the guests telephoned Jean on Sunday and reported illness. On Monday, 23rd April 1990, there was an unusually large number of department members absent from work and reporting illness. Data from this outbreak is stored in the file bateman.dat.

The variables in the file bateman.dat are:

 

ILL Ill?
CHEESE Cheddar cheese
CRABDIP Crab dip
CRISPS Crisps
BREAD French bread
CHICKEN Chicken (roasted, served warm)
RICE Rice (boiled, served warm)
CAESAR Caesar salad
TOMATO Tomato salad
ICECREAM Vanilla ice-cream
CAKE Chocolate cake
JUICE Orange juice
WINE White wine
COFFEE Coffee

Data is available for all forty-five guests at the luncheon. All of the variables are coded 1=yes, 2=no. Retrieve and attach the bateman dataset in R:

 

bateman <- read.table("bateman.dat", header = TRUE)
bateman
attach(bateman)
##   ILL CHEESE CRABDIP CRISPS BREAD CHICKEN RICE CAESAR TOMATO ICECREAM CAKE
## 1   1      1       1      1     2       1    1      1      1        1    1
## 2   2      1       1      1     2       1    2      2      2        1    1
## 3   1      2       2      1     2       1    2      1      2        1    1
## 4   1      1       2      1     1       1    2      1      2        1    1
## 5   1      1       1      1     2       1    1      1      1        2    1
## 6   1      1       1      1     1       1    2      1      1        2    1
##   JUICE WINE COFFEE
## 1     1    1      1
## 2     1    1      2
## 3     2    1      2
## 4     2    1      2
## 5     1    1      1
## 6     1    2      2
## The following objects are masked from salex:
## 
##     CRISPS, ILL, RICE, TOMATO

 

We will use our tab2by2() function to analyse this data. Retrieve this function:

 

load("tab2by2.r")

Use the tab2by2() function to analyse the data:

 

tab2by2(CHEESE, ILL)
tab2by2(CRABDIP, ILL)
tab2by2(CRISPS, ILL)
tab2by2(BREAD, ILL)
tab2by2(CHICKEN, ILL)
tab2by2(RICE, ILL)
tab2by2(CAESAR, ILL)
tab2by2(TOMATO, ILL)
tab2by2(ICECREAM, ILL)
tab2by2(CAKE, ILL)
tab2by2(JUICE, ILL)
tab2by2(WINE, ILL)
tab2by2(COFFEE, ILL)

 

tab2by2(CHEESE, ILL)
## 
##         outcome
## exposure  1  2
##        1 15  7
##        2 14  9
## 
## Relative Risk     : 1.12013 
## 95% CI            : 0.7253229 1.729838 
## 
## Sample Odds Ratio : 1.377551 
## 95% CI            : 0.4037553 4.699992 
## 
## MLE Odds Ratio    : 1.367743 
## 95% CI            : 0.3427732 5.649399
tab2by2(CRABDIP, ILL)
## 
##         outcome
## exposure  1  2
##        1 18  9
##        2 11  7
## 
## Relative Risk     : 1.090909 
## 95% CI            : 0.6921784 1.719329 
## 
## Sample Odds Ratio : 1.272727 
## 95% CI            : 0.3682028 4.3993 
## 
## MLE Odds Ratio    : 1.265848 
## 95% CI            : 0.3042941 5.188297
tab2by2(CRISPS, ILL)
## 
##         outcome
## exposure  1  2
##        1 21 12
##        2  8  4
## 
## Relative Risk     : 0.9545455 
## 95% CI            : 0.5930168 1.536478 
## 
## Sample Odds Ratio : 0.875 
## 95% CI            : 0.2170373 3.527619 
## 
## MLE Odds Ratio    : 0.8775841 
## 95% CI            : 0.1587568 4.184763
tab2by2(BREAD, ILL)
## 
##         outcome
## exposure  1  2
##        1  9  8
##        2 20  8
## 
## Relative Risk     : 0.7411765 
## 95% CI            : 0.4469843 1.228997 
## 
## Sample Odds Ratio : 0.45 
## 95% CI            : 0.1280647 1.581232 
## 
## MLE Odds Ratio    : 0.4584416 
## 95% CI            : 0.1072622 1.897017
tab2by2(CHICKEN, ILL)
## 
##         outcome
## exposure  1  2
##        1 25 11
##        2  4  5
## 
## Relative Risk     : 1.5625 
## 95% CI            : 0.7293337 3.347448 
## 
## Sample Odds Ratio : 2.840909 
## 95% CI            : 0.637796 12.65415 
## 
## MLE Odds Ratio    : 2.76979 
## 95% CI            : 0.4912167 16.93409
tab2by2(RICE, ILL)
## 
##         outcome
## exposure  1  2
##        1 22 10
##        2  7  6
## 
## Relative Risk     : 1.276786 
## 95% CI            : 0.7330759 2.223756 
## 
## Sample Odds Ratio : 1.885714 
## 95% CI            : 0.5027038 7.073586 
## 
## MLE Odds Ratio    : 1.85813 
## 95% CI            : 0.4026256 8.531602
tab2by2(CAESAR, ILL)
## 
##         outcome
## exposure  1  2
##        1 26  5
##        2  3 11
## 
## Relative Risk     : 3.913978 
## 95% CI            : 1.418617 10.7987 
## 
## Sample Odds Ratio : 19.06667 
## 95% CI            : 3.866585 94.02038 
## 
## MLE Odds Ratio    : 17.33517 
## 95% CI            : 3.179027 133.7994
tab2by2(TOMATO, ILL)
## 
##         outcome
## exposure  1  2
##        1 24  6
##        2  5 10
## 
## Relative Risk     : 2.4 
## 95% CI            : 1.14769 5.018775 
## 
## Sample Odds Ratio : 8 
## 95% CI            : 1.97785 32.35836 
## 
## MLE Odds Ratio    : 7.553116 
## 95% CI            : 1.642249 41.02567
tab2by2(ICECREAM, ILL)
## 
##         outcome
## exposure  1  2
##        1 20  9
##        2  9  7
## 
## Relative Risk     : 1.226054 
## 95% CI            : 0.7463643 2.01404 
## 
## Sample Odds Ratio : 1.728395 
## 95% CI            : 0.4889138 6.110177 
## 
## MLE Odds Ratio    : 1.7069 
## 95% CI            : 0.4021245 7.255001
tab2by2(CAKE, ILL)
## 
##         outcome
## exposure  1  2
##        1 22 11
##        2  7  5
## 
## Relative Risk     : 1.142857 
## 95% CI            : 0.6689315 1.95255 
## 
## Sample Odds Ratio : 1.428571 
## 95% CI            : 0.3678242 5.548347 
## 
## MLE Odds Ratio    : 1.416945 
## 95% CI            : 0.2847257 6.685098
tab2by2(JUICE, ILL)
## 
##         outcome
## exposure  1  2
##        1  8  5
##        2 21 11
## 
## Relative Risk     : 0.9377289 
## 95% CI            : 0.5701453 1.542301 
## 
## Sample Odds Ratio : 0.8380952 
## 95% CI            : 0.2206785 3.182927 
## 
## MLE Odds Ratio    : 0.8414367 
## 95% CI            : 0.185464 4.101313
tab2by2(WINE, ILL)
## 
##         outcome
## exposure  1  2
##        1 22 12
##        2  7  4
## 
## Relative Risk     : 1.016807 
## 95% CI            : 0.6099343 1.695094 
## 
## Sample Odds Ratio : 1.047619 
## 95% CI            : 0.2543383 4.315141 
## 
## MLE Odds Ratio    : 1.046515 
## 95% CI            : 0.1855742 5.186546
tab2by2(COFFEE, ILL)
## 
##         outcome
## exposure  1  2
##        1 17 11
##        2 12  5
## 
## Relative Risk     : 0.860119 
## 95% CI            : 0.5607997 1.319196 
## 
## Sample Odds Ratio : 0.6439394 
## 95% CI            : 0.1772875 2.338901 
## 
## MLE Odds Ratio    : 0.6502015 
## 95% CI            : 0.1388979 2.729586

Two variables (CAESAR and TOMATO) are associated with ILL.

These two variables are also associated with each other:

 

tab2by2(CAESAR, TOMATO)
chisq.test(table(CAESAR, TOMATO))
fisher.test(table(CAESAR, TOMATO))
tab2by2(CAESAR, TOMATO)
## 
##         outcome
## exposure  1  2
##        1 27  4
##        2  3 11
## 
## Relative Risk     : 4.064516 
## 95% CI            : 1.477162 11.1838 
## 
## Sample Odds Ratio : 24.75 
## 95% CI            : 4.738936 129.2616 
## 
## MLE Odds Ratio    : 22.10962 
## 95% CI            : 3.850174 183.4671

 

chisq.test(table(CAESAR, TOMATO))
## Warning in chisq.test(table(CAESAR, TOMATO)): Chi-squared approximation may be
## incorrect
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(CAESAR, TOMATO)
## X-squared = 15.877, df = 1, p-value = 6.759e-05
fisher.test(table(CAESAR, TOMATO))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(CAESAR, TOMATO)
## p-value = 3.442e-05
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##    3.850174 183.467108
## sample estimates:
## odds ratio 
##   22.10962

 

This suggests the potential for one of these associations to be due to confounding. We can perform a simple stratified analysis using the table() function:

 

table(CAESAR, ILL, TOMATO)
## , , TOMATO = 1
## 
##       ILL
## CAESAR  1  2
##      1 23  4
##      2  1  2
## 
## , , TOMATO = 2
## 
##       ILL
## CAESAR  1  2
##      1  3  1
##      2  2  9
table(TOMATO, ILL, CAESAR)
## , , CAESAR = 1
## 
##       ILL
## TOMATO  1  2
##      1 23  4
##      2  3  1
## 
## , , CAESAR = 2
## 
##       ILL
## TOMATO  1  2
##      1  1  2
##      2  2  9

 

It would be useful to calculate odds ratios for each stratum. We can define a simple function to calculate an odds ratio from a two-by-two table:

 

or <- function(x) {(x[1,1] / x[1,2]) / (x[2,1] / x[2,2])}

 

We can use apply() to apply the or() function to the two-by-two table in each stratum:

 

tabC <- table(CAESAR, ILL, TOMATO)
apply(tabC, 3, or)
##    1    2 
## 11.5 13.5
tabT <- table(TOMATO, ILL, CAESAR)
apply(tabT, 3, or)
##        1        2 
## 1.916667 2.250000

 

The 3 instructs the apply() function to apply the or() function to the third dimension of the table objects (i.e. levels of the potential confounder in tabC and tabT).

The mantelhaen.test() function performs the stratified analysis:

 

mantelhaen.test(tabC)
## 
##  Mantel-Haenszel chi-squared test with continuity correction
## 
## data:  tabC
## Mantel-Haenszel X-squared = 5.752, df = 1, p-value = 0.01647
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##   1.878994 83.156212
## sample estimates:
## common odds ratio 
##              12.5
mantelhaen.test(tabT)
## 
##  Mantel-Haenszel chi-squared test with continuity correction
## 
## data:  tabT
## Mantel-Haenszel X-squared = 0.049144, df = 1, p-value = 0.8246
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##   0.3156862 13.4192331
## sample estimates:
## common odds ratio 
##          2.058219

 

It is likely that CAESAR salad was a vehicle of food-poisoning, and that TOMATO salad was not a vehicle of food-poisoning. Many of those at the luncheon ate both CAESAR salad and TOMATO salad. CAESAR confounded the relationship between TOMATO and ILL. This resulted in a spurious association between TOMATO and ILL.

It only makes sense to calculate a common odds ratio in the absence of interaction. We can check for interaction ‘by eye’ by examining and comparing the odds ratios for each stratum as we did above.

There does appear to be an interaction between CAESAR, WINE, and ILL:

 

tabW <- table(CAESAR, ILL, WINE)
apply(tabW, 3, or)
##    1    2 
## 63.0  2.5

 

Woolf’s test for interaction (also known as Woolf’s test for the homogeneity of odds ratios) provides a formal test for interaction.

R does not provide a function to perform Woolf’s test for the homogeneity of odds ratios but it is possible to write a function to perform this test.

First we will create a template for the function:

 

woolf.test <- function(x) {}

 

And then use the fix() function to edit the woolf.test() function:

 

fix(woolf.test)

 

We can now edit this function to make it do something useful:

 

function(x) {
  x <- x + 0.5
  k <- dim(x)[3]
  or <- apply(x, 3, function(x)
              {(x[1, 1] / x[1, 2]) / (x[2, 1] / x[2, 2])})
  w <- apply(x, 3, function(x) {1 / sum(1 / x)})
  chi.sq <- sum(w * (log(or) - weighted.mean(log(or), w))^2)
  p <- pchisq(chi.sq, df = k - 1, lower.tail = FALSE)
  cat("\nWoolf's X2 :", chi.sq,
      "\np-value    :", p, "\n")
}

Once you have made the changes shown above, check your work, save the file, and quit the editor. We can use the woolf.test() function to test for a three-way interaction between CAESAR, WINE, and ILL:

 

woolf.test(tabW)

 

Which returns:

 

## 
## Woolf's X2 : 3.319492 
## p-value    : 0.06846297

 

Which is weak evidence of an interaction.

We should test for interaction between CAESAR, TOMATO, and ILL before accepting the results reported by the mantelhaen.test() function:

 

woolf.test(tabC)
## 
## Woolf's X2 : 0.0001233783 
## p-value    : 0.9911376

 

We can repeat this analysis using logistic regression.

We need to change the coding of the variables to 0 and 1 before specifying the model:

 

detach(bateman)
bateman <- 2 - bateman
bateman
bateman.lreg <- glm(formula = ILL ~ CAESAR + TOMATO,
                    family = binomial(logit), data = bateman)
summary(bateman.lreg)
bateman.lreg <- update(bateman.lreg, . ~ . - TOMATO)
summary(bateman.lreg)
## 
## Call:
## glm(formula = ILL ~ CAESAR + TOMATO, family = binomial(logit), 
##     data = bateman)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.960  -0.641   0.563   0.563   1.835  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  -1.4780     0.7101  -2.082  0.03739 * 
## CAESAR        2.5202     0.9653   2.611  0.00904 **
## TOMATO        0.7197     0.9552   0.753  0.45116   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 58.574  on 44  degrees of freedom
## Residual deviance: 41.408  on 42  degrees of freedom
## AIC: 47.408
## 
## Number of Fisher Scoring iterations: 4
## 
## Call:
## glm(formula = ILL ~ CAESAR, family = binomial(logit), data = bateman)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9103  -0.6945   0.5931   0.5931   1.7552  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -1.2993     0.6513  -1.995 0.046066 *  
## CAESAR        2.9479     0.8141   3.621 0.000293 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 58.574  on 44  degrees of freedom
## Residual deviance: 41.940  on 43  degrees of freedom
## AIC: 45.94
## 
## Number of Fisher Scoring iterations: 4

Interactions are specified using the multiply (*) symbol in the model formula:

 

bateman.lreg <- glm(formula = ILL ~ CAESAR + WINE + CAESAR * WINE,
                    family = binomial(logit), data = bateman)
summary(bateman.lreg)
## 
## Call:
## glm(formula = ILL ~ CAESAR + WINE + CAESAR * WINE, family = binomial(logit), 
##     data = bateman)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0393  -0.4590   0.5168   0.5168   2.1460  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.092e-15  1.000e+00   0.000   1.0000  
## CAESAR       9.163e-01  1.304e+00   0.703   0.4822  
## WINE        -2.197e+00  1.453e+00  -1.512   0.1305  
## CAESAR:WINE  3.227e+00  1.787e+00   1.806   0.0709 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 58.574  on 44  degrees of freedom
## Residual deviance: 38.508  on 41  degrees of freedom
## AIC: 46.508
## 
## Number of Fisher Scoring iterations: 4

 

Before we continue, it is probably a good idea to save the woolf.test() function for later use:

 

save(woolf.test, file = "woolf.r")

 

3.1 Matched data

Matching is another way to control for the effects of potential confounding variables. Matching is usually performed during data-collection as part of the design of a study.

In a matched case-control studies, each case is matched with one or more controls which are chosen to have the same values over a set of potential confounding variables. In order to illustrate how matched data may be analysed using tabulation and stratification in R we will start with the simple case of one-to-one matching (i.e. each case has a single matched control):

 

octe <- read.table("octe.dat", header = TRUE)
octe[1:10, ]
##    ID CASE OC
## 1   1    1  1
## 2   1    2  1
## 3   2    1  1
## 4   2    2  1
## 5   3    1  1
## 6   3    2  1
## 7   4    1  1
## 8   4    2  1
## 9   5    1  1
## 10  5    2  1

 

This data is from a matched case-control study investigating the association between oral contraceptive use and thromboembolism. The cases are 175 women aged between 15 and 44 years admitted to hospital for thromboembolism and discharged alive. The controls are female patients admitted for conditions believed to be unrelated to oral contraceptive use. Cases and controls were matched on age, ethnic group, marital status, parity, income, place of residence, and date of hospitalisation. The variables in the dataset are:

ID Identifier for the matched sets of cases and controls
CASE Case (1) or control (2)
OC Used oral contraceptives in the previous month (1=yes, 2=no)

 

The dataset consists of 350 records:

 

nrow(octe)
## [1] 350

 

There are 175 matched sets of cases and controls:

 

length(unique(octe$ID))
## [1] 175

 

In each matched set of cases and controls there is one case and one control:

 

table(octe$ID, octe$CASE)
##     
##      1 2
##   1  1 1
##   2  1 1
##   3  1 1
##   4  1 1
##   5  1 1
##   6  1 1
##   7  1 1
##   8  1 1
##   9  1 1
##   10 1 1

This data may be analysed using McNemar’s chi-squared test which use the number of discordant (i.e. relative to exposure) pairs of matched cases and controls.

To find the number of discordant pairs we need to split the dataset into cases and controls:

 

octe.cases <- subset(octe, CASE == 1)
octe.controls <- subset(octe, CASE == 2)

 

Sorting these two datasets (i.e. octe.cases and octe.controls) by the ID variable simplifies the analysis:

 

octe.cases <- octe.cases[order(octe.cases$ID), ]
octe.controls <- octe.controls[order(octe.controls$ID), ]

 

Since the two datasets (i.e. octe.cases and octe.controls) are now sorted by the ID variable we can use the table() function to retrieve the number if concordant and discordant pairs and store them in a table object:

 

tab <- table(octe.cases$OC, octe.controls$OC)
tab
##    
##      1  2
##   1 10 57
##   2 13 95

 

This table object (i.e. tab) can then be passed to the mcnemar.test() function:

 

mcnemar.test(tab)
## 
##  McNemar's Chi-squared test with continuity correction
## 
## data:  tab
## McNemar's chi-squared = 26.414, df = 1, p-value = 2.755e-07

 

The mcnemar.test() function does not provide an estimate of the odds ratio. This is the ratio of the discordant pairs:

 

r <- tab[1,2]
s <- tab[2,1]
rdp <- r / s
rdp
## [1] 4.384615

 

A confidence interval can also be calculated:

 

ci.p <- binom.test(r, r + s)$conf.int
ci.rdp <- ci.p / (1 - ci.p)
ci.rdp
## [1] 2.371377 8.731311
## attr(,"conf.level")
## [1] 0.95

 

This provides a 95% confidence interval. Other (e.g. 99%) confidence intervals can be produced by specifying appropriate values for the conf.level parameter of the binom.test() function:

 

ci.p <- binom.test(r, r + s, conf.level = 0.99)$conf.int
ci.rdp <- ci.p / (1 - ci.p)
ci.rdp
## [1]  2.010478 10.949095
## attr(,"conf.level")
## [1] 0.99

An alternative way of analysing this data is to use the mantelhaen.test() function:

 

tab <- table(octe$OC, octe$CASE, octe$ID)
mantelhaen.test(tab)
## 
##  Mantel-Haenszel chi-squared test with continuity correction
## 
## data:  tab
## Mantel-Haenszel X-squared = 26.414, df = 1, p-value = 2.755e-07
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##  2.400550 8.008521
## sample estimates:
## common odds ratio 
##          4.384615

 

The Mantel-Haenszel approach is preferred because it can be used with data from matched case-control studies that match more than one control to each case. Multiple matching is useful when the condition being studied is rare or at the early stages of an outbreak (i.e. when cases are hard to find and controls are easy to find).

We will now work with some data where each case has one or more controls:

 

tsstamp <- read.table("tsstamp.dat", header = TRUE)
tsstamp
##    ID CASE RBTAMP
## 1   1    1      1
## 2   1    2      1
## 3   1    2      1
## 4   2    1      1
## 5   2    2      2
## 6   2    2      1
## 7   2    2      1
## 8   3    1      2
## 9   3    2      2
## 10  3    2      1
## 11  4    1      1
## 12  4    2      2
## 13  5    1      1
## 14  5    2      1
## 15  5    2      2
## 16  6    1      1
## 17  6    2      1
## 18  7    1      1
## 19  7    2      2
## 20  7    2      2
## 21  8    1      1
## 22  8    2      1
## 23  8    2      2
## 24  9    1      2
## 25  9    2      1
## 26  9    2      2
## 27  9    2      2
## 28 10    1      1
## 29 10    2      2
## 30 10    2      2
## 31 11    1      1
## 32 11    2      2
## 33 11    2      2
## 34 12    1      1
## 35 12    2      2
## 36 12    2      2
## 37 12    2      2
## 38 13    1      1
## 39 13    2      2
## 40 13    2      2
## 41 14    1      2
## 42 14    2      2
## 43 14    2      2

 

This data is from a matched case-control study investigating the association between the use of different brands of tampon and toxic shock syndrome undertaken during an outbreak. Only a subset of the original dataset is used here. The variables in the dataset are:

ID Identifier for the matched sets of cases and controls
CASE Case (1) or control (2)
RBTAMP Used Rely brand tampons (1=yes, 2=no)

 

The dataset consists of forty-three (43) records:

 

nrow(tsstamp)
## [1] 43

 

There are fourteen (14) matched sets of cases and controls:

 

length(unique(tsstamp$ID))
## [1] 14

 

Each matched set of cases and controls consists of one case and one or more controls:

 

table(tsstamp$ID, tsstamp$CASE)
##     
##      1 2
##   1  1 2
##   2  1 3
##   3  1 2
##   4  1 1
##   5  1 2
##   6  1 1
##   7  1 2
##   8  1 2
##   9  1 3
##   10 1 2
##   11 1 2
##   12 1 3
##   13 1 2
##   14 1 2

 

The McNemar’s chi-squared test is not useful for this data as it is limited to the special case of one-to-one matching.

Analysing this data using a simple tabulation such as:

 

fisher.test(table(tsstamp$RBTAMP, tsstamp$CASE))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  table(tsstamp$RBTAMP, tsstamp$CASE)
## p-value = 0.007805
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##   1.542686 53.734756
## sample estimates:
## odds ratio 
##   7.709932

 

ignores the matched nature of the data and is, therefore, also not useful for this data.

The matched nature of the data may be accounted by stratifying on the variable that identifies the matched sets of cases and controls (i.e. the ID variable) using the mantelhaen.test() function:

 

mantelhaen.test(table(tsstamp$RBTAMP, tsstamp$CASE, tsstamp$ID))
## 
##  Mantel-Haenszel chi-squared test with continuity correction
## 
## data:  table(tsstamp$RBTAMP, tsstamp$CASE, tsstamp$ID)
## Mantel-Haenszel X-squared = 5.9384, df = 1, p-value = 0.01481
## alternative hypothesis: true common odds ratio is not equal to 1
## 95 percent confidence interval:
##   1.589505 43.191463
## sample estimates:
## common odds ratio 
##          8.285714

 

Analysis of several risk factors or adjustment for confounding variables not matched for in the design of a matched case-control study cannot be performed using tabulation-based procedures such as the McNemar's chi- squared test and Mantel-Haenszel procedures. In these situations a special form of logistic regression, called conditional logistic regression, should be used.

We can now quit R:

 

q()

 

For this exercise there is no need to save the workspace image so click the No or Don’t Save button (GUI) or enter n when prompted to save the workspace image (terminal).

3.2 Summary

  • R provides functions for many kinds of complex statistical analysis. We have looked at using the generalised linear model glm() function to perform logistic regression. We have looked ar the mantelhaen.test() function to perform stratified analyses and the mantelhaen.test() and mcnemar.test() functions to analyse data from matched case-control studies.

  • R can be extended by writing new functions. New functions can perform simple or complex data analysis. New functions can be composed of parts of existing function. New functions can be saved and used in subsequent R sessions. By building your own functions you can use R to build your own statistical analysis system.