Exercise 2 Manipulating objects and creating new functions
In this exercise we will explore how to manipulate R
objects and how to write functions that can manipulate and extract data and information from R
objects and produce useful analyses.
Before we go any further we should start R
and retrieve a dataset:
<- read.table("salex.dat", header = TRUE, na.strings = "9") salex
Missing values are coded as 9 throughout this dataset so we can use the na.strings
parameter of the read.table()
function to replace all 9’s with the special NA
code when we retrieve the dataset. Check that this works by examining the salex
data.frame:
salexnames(salex)
## ILL HAM BEEF EGGS MUSHROOM PEPPER PORKPIE PASTA RICE LETTUCE TOMATO COLESLAW
## 1 1 1 1 1 1 1 2 2 2 2 2 2
## 2 1 1 1 1 2 2 1 2 2 2 1 2
## 3 1 1 1 1 1 1 1 1 1 1 2 2
## 4 1 1 1 1 2 2 2 2 2 1 1 2
## 5 1 1 1 1 1 1 1 1 1 1 1 1
## 6 1 1 1 1 2 2 2 2 2 2 1 1
## CRISPS PEACHCAKE CHOCOLATE FRUIT TRIFLE ALMONDS
## 1 2 2 2 2 2 2
## 2 2 2 2 2 2 2
## 3 1 2 1 2 2 2
## 4 2 2 1 2 2 2
## 5 2 2 1 2 1 2
## 6 1 2 1 2 2 2
## [1] "ILL" "HAM" "BEEF" "EGGS" "MUSHROOM" "PEPPER"
## [7] "PORKPIE" "PASTA" "RICE" "LETTUCE" "TOMATO" "COLESLAW"
## [13] "CRISPS" "PEACHCAKE" "CHOCOLATE" "FRUIT" "TRIFLE" "ALMONDS"
This data comes from a food-borne outbreak. On Saturday 17th October 1992, eighty-two people attended a buffet meal at a sports club. Within fourteen to twenty-four hours, fifty-one of the participants developed diarrhoea, with nausea, vomiting, abdominal pain and fever.
The columns in the dataset are as follows:
ILL | Ill or not-ill |
HAM | Baked ham |
BEEF | Roast beef |
EGGS | Eggs |
MUSHROOM | Mushroom flan |
PEPPER | Pepper flan |
PORKPIE | Pork pie |
PASTA | Pasta salad |
RICE | Rice salad |
LETTUCE | Lettuce |
TOMATO | Tomato salad |
COLESLAW | Coleslaw |
CRISPS | Crisps |
PEACHCAKE | Peach cake |
CHOCOLATE | Chocolate cake |
FRUIT | Tropical fruit salad |
TRIFLE | Trifle |
ALMONDS | Almonds |
Data is available for seventy-seven of the eighty-two people who attended the sports club buffet. All of the variables are coded 1=yes, 2=no.
We can use the attach()
function to make it easier to access our data:
attach(salex)
The two-by-two table is a basic epidemiological tool. In analysing data from a food-borne outbreak collected as a retrospective cohort study, for example, we would tabulate each exposure (suspect foodstuffs) against the outcome (illness) and calculate risk ratios and confidence intervals. R
has no explicit function to calculate risk ratios from two-by-two tables but we can easily write one ourselves.
The first step in writing such a function would be to create the two-by-two table. This can be done with the table()
function. We will use a table of HAM
by ILL
as an illustration:
table(HAM, ILL)
This command produces the following output:
## ILL
## HAM 1 2
## 1 46 17
## 2 5 9
We can manipulate the output directly but it is easier if we instruct R
to save the output of the table()
function in an object:
<- table(HAM, ILL) tab
The tab
object contains the output of the table()
function:
tab
## ILL
## HAM 1 2
## 1 46 17
## 2 5 9
As it is stored in an object we can examine its contents on an item by item basis.
The tab
object is an object of class table
:
class(tab)
## [1] "table"
We can extract data from a table object by using indices or row and column co-ordinates:
1,1]
tab[1,2]
tab[2,1] tab[
## [1] 46
## [1] 17
## [1] 5
The numbers in the square brackets refer to the position (as row and column co-ordinates) of the data item in the table not the values of the variables. We can extract data using the values of the row and column variables by enclosing the index values in double quotes (“). For example:
"1","1"] tab[
## [1] 46
The two methods of extracting data may be combined. For example:
1,"1"] tab[
## [1] 46
We can calculate a risk ratio using the extracted data:
1,1]/(tab[1,1]+tab[1,2]))/(tab[2,1]/(tab[2,1]+tab[2,2])) (tab[
Which returns a risk ratio of
## [1] 2.044444
This is a tedious calculation to have to type in every time you need to calculate a risk ratio from a two-by-two table. It would be better to have a function that calculates and displays the risk ratio automatically. Fortunately, R
allows us to do just that.
The function()
function allows us to create new functions in R
:
<- function(exposure, outcome) {} tab2by2
This creates an empty function called tab2by2
that expects two parameters called exposure
and outcome
. We could type the whole function in at the R
command prompt but it is easier to use a text editor:
fix(tab2by2)
This will start an editor with the empty tab2by2()
function already loaded. We can now edit this function to make it do something useful:
function(exposure, outcome)
{<- table(exposure, outcome)
tab <- tab[1,1]
a <- tab[1,2]
b <- tab[2,1]
c <- tab[2,2]
d <- (a / (a + b)) / (c / (c + d))
rr print(tab)
print(rr)
}
Once you have made the changes shown above, check your work, save the file, and quit the editor.
Before proceeding we should examine the tab2by2()
function to make sure we understand what the function will do:
The first line defines
tab2by2
as a function that expects to be given two parameters which are calledexposure
andoutcome
.The body of the function (i.e. the work of the function) is enclosed within curly brackets (
{}
).The first line of the body of the function creates a table object (
tab
) using the variables specified when thetab2by2()
function is called (these are the parametersexposure
andoutcome
).The next line creates four new objects (called
a
,b
,c
, andd
) which contain the values of the four cells in the two-by-two table.The following line calculates the risk ratio using the objects
a
,b
,c
, andd
and stores the result of the calculation in an object calledrr
.The final two lines print the contents of the
tab
andrr
objects.
Let’s try the tab2by2()
function with our test data:
tab2by2(HAM, ILL)
## outcome
## exposure 1 2
## 1 46 17
## 2 5 9
## [1] 2.044444
The tab2by2()
function displays a table of HAM
by ILL
followed by the risk ratio calculated from the data in the table.
Try producing another table:
tab2by2(PASTA, ILL)
## outcome
## exposure 1 2
## 1 25 3
## 2 26 23
## [1] 1.682692
Have a look at the R
objects available to you:
ls()
## [1] "fem" "salex" "tab" "tab2by2"
Note that there are no a
, b
, c
, d
, or rr
objects.
Examine the tab
object:
tab
## ILL
## HAM 1 2
## 1 46 17
## 2 5 9
This is the table of HAM
by ILL
that you created earlier not the table of PASTA
by ILL
that was created by the tab2by2()
function.
The tab
, a
, b
, c
, d
, and rr
objects in the tab2by2()
function are local to that function and do not change anything outside of that function. This means that the tab
object inside the function is independent of any object of the same name outside of the function.
When a function completes its work, all of the objects that are local to that function are automatically removed. This is useful as it means that you can use object names inside functions that will not interfere with objects of the same name that are stored elsewhere. It also means that you do not clutter up the R
workspace with temporary objects.
Just to prove that tab
in the tab2by2()
function exists only in the tab2by2()
function we can delete the tab object from the R
workspace:
rm(tab)
Now try another call to the tab2by2()
function:
tab2by2(FRUIT, ILL)
## outcome
## exposure 1 2
## 1 1 4
## 2 49 22
## [1] 0.2897959
Now list the R
objects available to you:
ls()
## [1] "fem" "salex" "tab2by2"
Note that there are no tab
, a
, b
, c
, d
, or rr
objects.
The tab2by2()
function is very limited. It only displays a table and calculates and displays a simple ratio. A more useful function would also calculate and display a confidence interval for the risk ratio. This is what we will do now. Use the fix()
function to edit the tab2by2()
function:
fix(tab2by2)
We can now edit this function to calculate and display a 95% confidence interval for the risk ratio.
function(exposure, outcome) {
<- table(exposure, outcome)
tab <- tab[1,1]
a <- tab[1,2]
b <- tab[2,1]
c <- tab[2,2]
d <- (a / (a + b)) / (c / (c + d))
rr <- sqrt((b / a) / (a + b) + (d / c) / (c + d))
se.log.rr <- exp(log(rr) - 1.96 * se.log.rr)
lci.rr <- exp(log(rr) + 1.96 * se.log.rr)
uci.rr print(tab)
print(rr)
print(lci.rr)
print(uci.rr)
}
Once you have made the changes shown above, check your work, save the file, and quit the editor. We should test our revised function:
tab2by2(EGGS, ILL)
which produces the following output:
## outcome
## exposure 1 2
## 1 40 6
## 2 10 20
## [1] 2.608696
## [1] 1.553564
## [1] 4.38044
The function works but the output could be improved. Use the fix()
function to edit the tab2by2()
function:
function(exposure, outcome) {
<- table(exposure, outcome)
tab <- tab[1,1]
a <- tab[1,2]
b <- tab[2,1]
c <- tab[2,2]
d <- (a / (a + b)) / (c / (c + d))
rr <- sqrt((b / a) / (a + b) + (d / c) / (c + d))
se.log.rr <- exp(log(rr) - 1.96 * se.log.rr)
lci.rr <- exp(log(rr) + 1.96 * se.log.rr)
uci.rr print(tab)
cat("\nRR :", rr,
"\n95% CI :", lci.rr, uci.rr, "\n")
}
Once you have made the changes shown above, save the file and quit the editor.
Now we can test our function again:
tab2by2(EGGS, ILL)
Which produces the following output:
## outcome
## exposure 1 2
## 1 40 6
## 2 10 20
##
## RR : 2.608696
## 95% CI : 1.553564 4.38044
The tab2by2()
function displays output but does not behave like a standard R
function in the sense that you cannot save the results of the tab2by2()
function into an object:
<- tab2by2(EGGS, ILL) test2by2
## outcome
## exposure 1 2
## 1 40 6
## 2 10 20
##
## RR : 2.608696
## 95% CI : 1.553564 4.38044
displays output but does not save anything in the test2by2
object:
test2by2
## NULL
The returned value (NULL
) means that test2by2
is an empty object. We will not worry about this at the moment as the tab2by2()
function is good-enough for our current purposes. In Exercise 6 we will explore how to make our own functions behave like standard R
functions.
We will now add the calculation of the odds ratio and its 95% confidence interval to the tab2by2()
function using the fix()
function.
There are two ways of doing this. We could either calculate the odds ratio from the table and use (e.g.) the method of Woolf to calculate the confidence interval:
<- (a / b) / (c / d)
or <- sqrt(1 / a + 1 / b + 1 / c + 1 / d)
se.log.or <- exp(log(or) - 1.96 * se.log.or)
lci.or <- exp(log(or) + 1.96 * se.log.or)
uci.or cat("\nOR :", or,
"\n95% CI :", lci.or, uci.or, "\n")
or use the output of the fisher.test()
function:
<- fisher.test(tab)
ft cat("\nOR :", ft$estimate,
"\n95% CI :", ft$conf.int, "\n")
Note that we can refer to components of a function’s output using the same syntax as when we refer to columns in a data.frame (e.g. ft$estimate
to examine the estimate of the odds ratio from the fisher.test()
function stored in the object ft
).
The names of elements in the output of a standard function such as fisher.test()
can be found in the documentation or the help system. For example:
help(fisher.test)
Output elements are listed under the Value
heading.
Revise the tab2by2()
function to include the calculation of the odds ratio and the 95% confidence interval. The revised function will look something like this:
function(exposure, outcome) {
<- table(exposure, outcome)
tab <- tab[1,1]
a <- tab[1,2]
b <- tab[2,1]
c <- tab[2,2]
d <- (a / (a + b)) / (c / (c + d))
rr <- sqrt((b / a) / (a + b) + (d / c) / (c + d))
se.log.rr <- exp(log(rr) - 1.96 * se.log.rr)
lci.rr <- exp(log(rr) + 1.96 * se.log.rr)
uci.rr <- (a / b) / (c / d)
or <- sqrt(1 / a + 1 / b + 1 / c + 1 / d)
se.log.or <- exp(log(or) - 1.96 * se.log.or)
lci.or <- exp(log(or) + 1.96 * se.log.or)
uci.or <- fisher.test(tab)
ft cat("\n")
print(tab)
cat("\nRelative Risk :", rr,
"\n95% CI :", lci.rr, uci.rr, "\n")
cat("\nSample Odds Ratio :", or,
"\n95% CI :", lci.or, uci.or, "\n")
cat("\nMLE Odds Ratio :", ft$estimate,
"\n95% CI :", ft$conf.int, "\n\n")
}
Once you have made the changes shown above, check your work, save the file, and quit the editor.
Test the tab2by2()
function when you have added the calculation of the odds ratio and its 95% confidence interval.
Now that we have a function that will calculate risk ratios and odds ratios with confidence intervals from a two- by-two table we can use it to analyse the salex
data:
tab2by2(HAM, ILL)
tab2by2(BEEF, ILL)
tab2by2(EGGS, ILL)
tab2by2(MUSHROOM, ILL)
tab2by2(PEPPER, ILL)
tab2by2(PORKPIE, ILL)
tab2by2(PASTA, ILL)
tab2by2(RICE, ILL)
tab2by2(LETTUCE, ILL)
tab2by2(TOMATO, ILL)
tab2by2(COLESLAW, ILL)
tab2by2(CRISPS, ILL)
tab2by2(PEACHCAKE, ILL)
tab2by2(CHOCOLATE, ILL)
tab2by2(FRUIT, ILL)
tab2by2(TRIFLE, ILL)
tab2by2(ALMONDS, ILL)
##
## outcome
## exposure 1 2
## 1 46 17
## 2 5 9
##
## Relative Risk : 2.044444
## 95% CI : 0.9964841 4.194501
##
## Sample Odds Ratio : 4.870588
## 95% CI : 1.428423 16.60756
##
## MLE Odds Ratio : 4.75649
## 95% CI : 1.22777 20.82921
##
## outcome
## exposure 1 2
## 1 45 22
## 2 6 4
##
## Relative Risk : 1.119403
## 95% CI : 0.6568821 1.907592
##
## Sample Odds Ratio : 1.363636
## 95% CI : 0.3485746 5.334594
##
## MLE Odds Ratio : 1.357903
## 95% CI : 0.2547114 6.428414
##
## outcome
## exposure 1 2
## 1 40 6
## 2 10 20
##
## Relative Risk : 2.608696
## 95% CI : 1.553564 4.38044
##
## Sample Odds Ratio : 13.33333
## 95% CI : 4.240168 41.92706
##
## MLE Odds Ratio : 12.74512
## 95% CI : 3.762787 50.05419
##
## outcome
## exposure 1 2
## 1 24 6
## 2 25 19
##
## Relative Risk : 1.408
## 95% CI : 1.028944 1.926697
##
## Sample Odds Ratio : 3.04
## 95% CI : 1.037274 8.909506
##
## MLE Odds Ratio : 2.995207
## 95% CI : 0.9421008 10.7953
##
## outcome
## exposure 1 2
## 1 24 3
## 2 23 22
##
## Relative Risk : 1.73913
## 95% CI : 1.26876 2.383882
##
## Sample Odds Ratio : 7.652174
## 95% CI : 2.013718 29.07844
##
## MLE Odds Ratio : 7.448216
## 95% CI : 1.861728 44.12015
##
## outcome
## exposure 1 2
## 1 21 9
## 2 29 17
##
## Relative Risk : 1.110345
## 95% CI : 0.8044752 1.532509
##
## Sample Odds Ratio : 1.367816
## 95% CI : 0.5113158 3.659032
##
## MLE Odds Ratio : 1.362228
## 95% CI : 0.4636016 4.190667
##
## outcome
## exposure 1 2
## 1 25 3
## 2 26 23
##
## Relative Risk : 1.682692
## 95% CI : 1.255392 2.255433
##
## Sample Odds Ratio : 7.371795
## 95% CI : 1.964371 27.66451
##
## MLE Odds Ratio : 7.195422
## 95% CI : 1.829867 42.07488
##
## outcome
## exposure 1 2
## 1 28 4
## 2 23 22
##
## Relative Risk : 1.711957
## 95% CI : 1.250197 2.344268
##
## Sample Odds Ratio : 6.695652
## 95% CI : 2.017327 22.22335
##
## MLE Odds Ratio : 6.532868
## 95% CI : 1.852297 29.84928
##
## outcome
## exposure 1 2
## 1 28 1
## 2 23 25
##
## Relative Risk : 2.014993
## 95% CI : 1.488481 2.727744
##
## Sample Odds Ratio : 30.43478
## 95% CI : 3.826938 242.041
##
## MLE Odds Ratio : 29.32825
## 95% CI : 4.161299 1284.306
##
## outcome
## exposure 1 2
## 1 29 9
## 2 22 17
##
## Relative Risk : 1.352871
## 95% CI : 0.974698 1.877771
##
## Sample Odds Ratio : 2.489899
## 95% CI : 0.9347213 6.632562
##
## MLE Odds Ratio : 2.459981
## 95% CI : 0.8467562 7.558026
##
## outcome
## exposure 1 2
## 1 29 3
## 2 21 23
##
## Relative Risk : 1.89881
## 95% CI : 1.366876 2.63775
##
## Sample Odds Ratio : 10.5873
## 95% CI : 2.806364 39.9417
##
## MLE Odds Ratio : 10.26269
## 95% CI : 2.600771 60.35431
##
## outcome
## exposure 1 2
## 1 21 10
## 2 30 16
##
## Relative Risk : 1.03871
## 95% CI : 0.7529065 1.433004
##
## Sample Odds Ratio : 1.12
## 95% CI : 0.4258139 2.945888
##
## MLE Odds Ratio : 1.118358
## 95% CI : 0.3858206 3.340535
##
## outcome
## exposure 1 2
## 1 2 2
## 2 49 24
##
## Relative Risk : 0.744898
## 95% CI : 0.27594 2.010846
##
## Sample Odds Ratio : 0.4897959
## 95% CI : 0.06497947 3.691936
##
## MLE Odds Ratio : 0.4947099
## 95% CI : 0.03393887 7.209143
##
## outcome
## exposure 1 2
## 1 12 2
## 2 38 24
##
## Relative Risk : 1.398496
## 95% CI : 1.045064 1.871456
##
## Sample Odds Ratio : 3.789474
## 95% CI : 0.7791326 18.43089
##
## MLE Odds Ratio : 3.733535
## 95% CI : 0.7318646 37.28268
##
## outcome
## exposure 1 2
## 1 1 4
## 2 49 22
##
## Relative Risk : 0.2897959
## 95% CI : 0.04985828 1.684408
##
## Sample Odds Ratio : 0.1122449
## 95% CI : 0.01185022 1.06318
##
## MLE Odds Ratio : 0.1157141
## 95% CI : 0.002240848 1.256134
##
## outcome
## exposure 1 2
## 1 19 5
## 2 32 21
##
## Relative Risk : 1.311198
## 95% CI : 0.9718621 1.769016
##
## Sample Odds Ratio : 2.49375
## 95% CI : 0.8067804 7.708156
##
## MLE Odds Ratio : 2.465794
## 95% CI : 0.7363311 9.778463
##
## outcome
## exposure 1 2
## 1 3 3
## 2 38 19
##
## Relative Risk : 0.75
## 95% CI : 0.3300089 1.7045
##
## Sample Odds Ratio : 0.5
## 95% CI : 0.09203498 2.716358
##
## MLE Odds Ratio : 0.505905
## 95% CI : 0.06170211 4.141891
Make a note of any positive associations (i.e. with a risk ratio > 1 with a 95% confidence intervals that does not include one). We will use these for the next exercise when we will use logistic regression to analyse this data.
Save the tab2by2()
function:
save(tab2by2, file = "tab2by2.r")
We can now quit R
:
q()
For this exercise there is no need to save the workspace image so click the No or Don’t Save button (GUI) or enter n
when prompted to save the workspace image (terminal).
2.1 Summary
R
objects contain information that can be examined and manipulated.R
can be extended by writing new functions.New functions can perform simple or complex data analysis.
New functions can be composed of parts of existing function.
New functions can be saved and used in subsequent
R
sessions.Objects defined within functions are local to that function and only exist while that function is being used. This means that you can re-use meaningful names within functions without them interfering with each other.