Exercise 4 Analysing some data with R
In this exercise we will use the R
functions we have already used and the functions we have added to R
to
analyse a small dataset. First we will start R
and retrieve our functions:
load("tab2by2.r")
load("lregor.r")
And then retrieve and attach the sample dataset:
<- read.table("gudhiv.dat", header = TRUE, na.strings = "X")
gudhiv attach(gudhiv)
This data is from a cross-sectional study of 435 male patients who presented with sexually transmitted infections at an outpatient clinic in The Gambia between August 1988 and June 1990. Several studies have documented an association between genital ulcer disease (GUD) and HIV infection. A study of Gambian prostitutes documented an association between seropositivity for HIV-2 and antibodies against Treponema pallidum
(a serological test for syphilis). Prostitutes are not the ideal population for such studies as they may have experienced multiple sexually transmitted infections and it is difficult to quantify the number of times they may have had sex with HIV-2 seropositive customers. A sample of males with sexually transmitted infections is easier to study as they have probably had fewer sexual partners than prostitutes and much less contact with sexually transmitted infection pathogens. In such a sample it is also easier to find subjects and collect data. The variables in the dataset are:
MARRIED | Married (1=yes, 0=no) |
GAMBIAN | Gambian Citizen (1=yes, 0=no) |
GUD | History of GUD or syphilis (1=yes, 0=no) |
UTIGC | History of urethral discharge (1=yes, 0=no) |
CIR | Circumcised (1=yes, 0=no) |
TRAVOUT | Travelled outside of Gambia and Senegal (1=yes, 0=no) |
SEXPRO | Ever had sex with a prostitute (1=yes, 0=no) |
INJ12M | Injection in previous 12 months (1=yes, 0=no) |
PARTNERS | Sexual partners in previous 12 months (number) |
HIV HIV-2 | positive serology (1=yes, 0=no) |
Data is available for all 435 patients enrolled in the study.
We will start our analysis by examining pairwise associations between the binary exposure variables and the
HIV variable using the tab2by2()
function that we wrote earlier:
tab2by2(MARRIED, HIV)
tab2by2(GAMBIAN, HIV)
tab2by2(GUD, HIV)
tab2by2(UTIGC, HIV)
tab2by2(CIR, HIV)
tab2by2(TRAVOUT, HIV)
tab2by2(SEXPRO, HIV)
tab2by2(INJ12M, HIV)
tab2by2(MARRIED, HIV)
##
## outcome
## exposure 0 1
## 0 321 13
## 1 93 8
##
## Relative Risk : 1.043751
## 95% CI : 0.9818512 1.109554
##
## Sample Odds Ratio : 2.124069
## 95% CI : 0.8545749 5.279433
##
## MLE Odds Ratio : 2.119801
## 95% CI : 0.7380371 5.714354
tab2by2(GAMBIAN, HIV)
##
## outcome
## exposure 0 1
## 0 73 4
## 1 341 17
##
## Relative Risk : 0.9953155
## 95% CI : 0.9400068 1.053879
##
## Sample Odds Ratio : 0.909824
## 95% CI : 0.2974059 2.783333
##
## MLE Odds Ratio : 0.9100104
## 95% CI : 0.2853202 3.826485
tab2by2(GUD, HIV)
##
## outcome
## exposure 0 1
## 0 339 12
## 1 72 9
##
## Relative Risk : 1.086538
## 95% CI : 1.003531 1.176412
##
## Sample Odds Ratio : 3.53125
## 95% CI : 1.434372 8.693509
##
## MLE Odds Ratio : 3.517408
## 95% CI : 1.258556 9.491924
tab2by2(UTIGC, HIV)
##
## outcome
## exposure 0 1
## 0 261 12
## 1 151 9
##
## Relative Risk : 1.013027
## 95% CI : 0.9678841 1.060275
##
## Sample Odds Ratio : 1.296358
## 95% CI : 0.5338453 3.147997
##
## MLE Odds Ratio : 1.295532
## 95% CI : 0.4703496 3.438842
tab2by2(CIR, HIV)
##
## outcome
## exposure 0 1
## 0 10 3
## 1 392 17
##
## Relative Risk : 0.8025903
## 95% CI : 0.5955085 1.081682
##
## Sample Odds Ratio : 0.1445578
## 95% CI : 0.0364195 0.5737851
##
## MLE Odds Ratio : 0.1460183
## 95% CI : 0.03322189 0.899754
tab2by2(TRAVOUT, HIV)
##
## outcome
## exposure 0 1
## 0 152 2
## 1 256 19
##
## Relative Risk : 1.060268
## 95% CI : 1.02181 1.100173
##
## Sample Odds Ratio : 5.640625
## 95% CI : 1.295879 24.55218
##
## MLE Odds Ratio : 5.624226
## 95% CI : 1.32716 50.45859
tab2by2(SEXPRO, HIV)
##
## outcome
## exposure 0 1
## 0 268 13
## 1 143 8
##
## Relative Risk : 1.007093
## 95% CI : 0.9621259 1.054161
##
## Sample Odds Ratio : 1.153308
## 95% CI : 0.4671083 2.847562
##
## MLE Odds Ratio : 1.152912
## 95% CI : 0.4042323 3.083152
tab2by2(INJ12M, HIV)
##
## outcome
## exposure 0 1
## 0 146 7
## 1 268 14
##
## Relative Risk : 1.004097
## 95% CI : 0.9610996 1.049018
##
## Sample Odds Ratio : 1.089552
## 95% CI : 0.4301305 2.759916
##
## MLE Odds Ratio : 1.089351
## 95% CI : 0.4006202 3.263814
Note that our tab2by2()
function returns misleading risk ratio estimates and confidence intervals for this dataset. This is because the function expects the exposure
and outcome
variables to be ordered with exposure-present and outcome-present as the first category (e.g. 1 = present, 2 = absent). This coding is reversed (i.e. 0 = absent, 1 = present) in the gudhiv
dataset.
We can produce risk ratio estimates for variables in the gudhiv
data using the tab2by2()
function and a
simple transformation of the exposure
and outcome
variables. For example:
tab2by2(2 - GUD, 2 - HIV)
##
## outcome
## exposure 1 2
## 1 9 72
## 2 12 339
##
## Relative Risk : 3.25
## 95% CI : 1.417411 7.451965
##
## Sample Odds Ratio : 3.53125
## 95% CI : 1.434372 8.693509
##
## MLE Odds Ratio : 3.517408
## 95% CI : 1.258556 9.491924
The odds ratio estimates returned by the tab2by2()
function, with or without this transformation, are correct. The GUD
and TRAVOUT
variables are associated with HIV
.
PARTNERS
is a continuous variable and we should examine its distribution before doing anything with it:
table(PARTNERS)
hist(PARTNERS)
table(PARTNERS)
## PARTNERS
## 1 2 3 4 5 6 7 8 9
## 61 129 133 71 25 6 6 2 2
The distribution of PARTNERS
is severely non-normal. Instead of attempting to transform the variable we will produce summary statistics for each level of the HIV
variable and perform a non-parametric test:
by(PARTNERS, HIV, summary)
kruskal.test(PARTNERS ~ HIV)
## HIV: 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.00 3.00 2.72 3.00 8.00
## ------------------------------------------------------------
## HIV: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 4.000 5.000 5.381 7.000 9.000
##
## Kruskal-Wallis rank sum test
##
## data: PARTNERS by HIV
## Kruskal-Wallis chi-squared = 32.036, df = 1, p-value = 1.514e-08
An alternative way of looking at the data is as a tabulation:
table(PARTNERS, HIV)
## HIV
## PARTNERS 0 1
## 1 60 1
## 2 128 1
## 3 131 2
## 4 68 3
## 5 21 4
## 6 3 3
## 7 2 4
## 8 1 1
## 9 0 2
You can use the plot()
function to represent this table graphically:
plot(table(PARTNERS, HIV), color = c("lightgreen", "red"))
There appears to be an association between the number of sexual PARTNERS
in the previous twelve months and positive HIV
serology. The proportion with positive HIV
serology increases as the number of sexual partners increases:
prop.table(table(PARTNERS, HIV), 1) * 100
## HIV
## PARTNERS 0 1
## 1 98.3606557 1.6393443
## 2 99.2248062 0.7751938
## 3 98.4962406 1.5037594
## 4 95.7746479 4.2253521
## 5 84.0000000 16.0000000
## 6 50.0000000 50.0000000
## 7 33.3333333 66.6666667
## 8 50.0000000 50.0000000
## 9 0.0000000 100.0000000
The ‘1’ instructs the prop.table()
function to calculate row proportions. You can also use the plot()
function to represent this table graphically:
plot(prop.table(table(PARTNERS, HIV), 1) * 100, color = c("lightgreen", "red"))
The chi-square test for trend is an appropriate test to perform on this data. The prop.trend.test()
function that performs the chi-square test for trend requires you to specify the number of events and the number of trials. In this table:
table(PARTNERS, HIV)
## HIV
## PARTNERS 0 1
## 1 60 1
## 2 128 1
## 3 131 2
## 4 68 3
## 5 21 4
## 6 3 3
## 7 2 4
## 8 1 1
## 9 0 2
The number of events in each row is in the second column (labelled 1) and the number of trials is the total number of cases in each row of the table.
We can extract this data from a table object:
<- table(PARTNERS, HIV)
tab <- tab[ ,2]
events <- tab[ ,1] + tab[ ,2] trials
<- table(PARTNERS, HIV)
tab <- tab[ ,2]
events <- tab[ ,1] + tab[ ,2] trials
Another way of creating the trials
object would be to use the apply()
function to sum the rows of the tab object:
<- apply(tab, 1, sum) trials
Pass this data to the prop.trend.test()
function:
prop.trend.test(events, trials)
##
## Chi-squared Test for Trend in Proportions
##
## data: events out of trials ,
## using scores: 1 2 3 4 5 6 7 8 9
## X-squared = 76.389, df = 1, p-value < 2.2e-16
With a linear trend such as this we can use PARTNERS
in a logistic model without recoding or creating indicator variables. We can now specify and fit the logistic regression model:
<- glm(formula = HIV ~ GUD + TRAVOUT + PARTNERS,
gudhiv.lreg family = binomial(logit))
summary(gudhiv.lreg)
##
## Call:
## glm(formula = HIV ~ GUD + TRAVOUT + PARTNERS, family = binomial(logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.70415 -0.19849 -0.11148 -0.06247 3.11742
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.4854 1.4663 -6.469 9.86e-11 ***
## GUD 1.3869 0.5937 2.336 0.0195 *
## TRAVOUT 2.0867 0.9547 2.186 0.0288 *
## PARTNERS 1.1605 0.2050 5.662 1.50e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 167.364 on 425 degrees of freedom
## Residual deviance: 99.377 on 422 degrees of freedom
## (9 observations deleted due to missingness)
## AIC: 107.38
##
## Number of Fisher Scoring iterations: 8
We can use the lreg.or()
function that we wrote earlier to calculate and display odds ratios and confidence intervals:
lreg.or(gudhiv.lreg)
## OR LCI UCI
## (Intercept) 0.00 0.00 0.00
## GUD 4.00 1.25 12.81
## TRAVOUT 8.06 1.24 52.35
## PARTNERS 3.19 2.14 4.77
PARTNERS
is incorporated into the logistic model as a continuous variable.
The odds ratio reported for PARTNERS
is the odds ratio associated with a unit increase in the number of sexual PARTNERS
. A man reporting five sexual partners, for example, was over three times as likely (odds ratio = 3.19) to have a positive HIV-2 serology than a man reporting four sexual partners.
An alternative approach would be to have created an indicator variables:
.5 <- ifelse(PARTNERS > 5, 1, 0) part.gt
This creates a new variable (part.gt.5
) that indicates whether or not an individual subject reported having more than five sexual partners in the previous twelve months:
table(PARTNERS, part.gt.5)
## part.gt.5
## PARTNERS 0 1
## 1 61 0
## 2 129 0
## 3 133 0
## 4 71 0
## 5 25 0
## 6 0 6
## 7 0 6
## 8 0 2
## 9 0 2
You can also inspect this on a case-by-case basis:
cbind(PARTNERS, part.gt.5)
## PARTNERS part.gt.5
## [1,] 2 0
## [2,] 2 0
## [3,] 1 0
## [4,] 2 0
## [5,] 3 0
## [6,] 2 0
We can now specify and fit the logistic regression model using our indicator variable:
<- glm(formula = HIV ~ GUD + TRAVOUT + part.gt.5,
gudhiv.lreg family = binomial(logit))
summary(gudhiv.lreg)
lreg.or(gudhiv.lreg)
##
## Call:
## glm(formula = HIV ~ GUD + TRAVOUT + part.gt.5, family = binomial(logit))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6092 -0.2205 -0.2205 -0.0719 3.4521
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.9559 0.9850 -6.046 1.48e-09 ***
## GUD 1.4930 0.5805 2.572 0.0101 *
## TRAVOUT 2.2514 0.9319 2.416 0.0157 *
## part.gt.5 4.6791 0.7560 6.189 6.05e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 167.36 on 425 degrees of freedom
## Residual deviance: 106.43 on 422 degrees of freedom
## (9 observations deleted due to missingness)
## AIC: 114.43
##
## Number of Fisher Scoring iterations: 7
## OR LCI UCI
## (Intercept) 0.00 0.00 0.02
## GUD 4.45 1.43 13.89
## TRAVOUT 9.50 1.53 59.02
## part.gt.5 107.67 24.47 473.84
We can now quit R:
q()
For this exercise there is no need to save the workspace image so click the No or Don’t Save button (GUI) or enter n
when prompted to save the workspace image (terminal).
4.1 Summary
Using built-in functions and our own functions we can use
R
to analyse epidemiological data.The power of
R
is that it can be easily extended. Many user-contributed functions (usually packages of related functions) are available for download over the Internet. We will use one of these packages in the next exercise.