Exercise 2 Manipulating objects and creating new functions

In this exercise we will explore how to manipulate R objects and how to write functions that can manipulate and extract data and information from R objects and produce useful analyses.

Before we go any further we should start R and retrieve a dataset:

 

salex <- read.table("salex.dat", header = TRUE, na.strings = "9")

 

Missing values are coded as 9 throughout this dataset so we can use the na.strings parameter of the read.table() function to replace all 9’s with the special NA code when we retrieve the dataset. Check that this works by examining the salex data.frame:

 

salex
names(salex)
##   ILL HAM BEEF EGGS MUSHROOM PEPPER PORKPIE PASTA RICE LETTUCE TOMATO COLESLAW
## 1   1   1    1    1        1      1       2     2    2       2      2        2
## 2   1   1    1    1        2      2       1     2    2       2      1        2
## 3   1   1    1    1        1      1       1     1    1       1      2        2
## 4   1   1    1    1        2      2       2     2    2       1      1        2
## 5   1   1    1    1        1      1       1     1    1       1      1        1
## 6   1   1    1    1        2      2       2     2    2       2      1        1
##   CRISPS PEACHCAKE CHOCOLATE FRUIT TRIFLE ALMONDS
## 1      2         2         2     2      2       2
## 2      2         2         2     2      2       2
## 3      1         2         1     2      2       2
## 4      2         2         1     2      2       2
## 5      2         2         1     2      1       2
## 6      1         2         1     2      2       2
##  [1] "ILL"       "HAM"       "BEEF"      "EGGS"      "MUSHROOM"  "PEPPER"   
##  [7] "PORKPIE"   "PASTA"     "RICE"      "LETTUCE"   "TOMATO"    "COLESLAW" 
## [13] "CRISPS"    "PEACHCAKE" "CHOCOLATE" "FRUIT"     "TRIFLE"    "ALMONDS"

 

This data comes from a food-borne outbreak. On Saturday 17th October 1992, eighty-two people attended a buffet meal at a sports club. Within fourteen to twenty-four hours, fifty-one of the participants developed diarrhoea, with nausea, vomiting, abdominal pain and fever.

The columns in the dataset are as follows:

ILL Ill or not-ill
HAM Baked ham
BEEF Roast beef
EGGS Eggs
MUSHROOM Mushroom flan
PEPPER Pepper flan
PORKPIE Pork pie
PASTA Pasta salad
RICE Rice salad
LETTUCE Lettuce
TOMATO Tomato salad
COLESLAW Coleslaw
CRISPS Crisps
PEACHCAKE Peach cake
CHOCOLATE Chocolate cake
FRUIT Tropical fruit salad
TRIFLE Trifle
ALMONDS Almonds

Data is available for seventy-seven of the eighty-two people who attended the sports club buffet. All of the variables are coded 1=yes, 2=no.

We can use the attach() function to make it easier to access our data:

 

attach(salex)

 

The two-by-two table is a basic epidemiological tool. In analysing data from a food-borne outbreak collected as a retrospective cohort study, for example, we would tabulate each exposure (suspect foodstuffs) against the outcome (illness) and calculate risk ratios and confidence intervals. R has no explicit function to calculate risk ratios from two-by-two tables but we can easily write one ourselves.

The first step in writing such a function would be to create the two-by-two table. This can be done with the table() function. We will use a table of HAM by ILL as an illustration:

 

table(HAM, ILL)

 

This command produces the following output:

 

##    ILL
## HAM  1  2
##   1 46 17
##   2  5  9

 

We can manipulate the output directly but it is easier if we instruct R to save the output of the table() function in an object:

 

tab <- table(HAM, ILL)

The tab object contains the output of the table() function:

 

tab
##    ILL
## HAM  1  2
##   1 46 17
##   2  5  9

 

As it is stored in an object we can examine its contents on an item by item basis.

The tab object is an object of class table:

 

class(tab)
## [1] "table"

 

We can extract data from a table object by using indices or row and column co-ordinates:

 

tab[1,1]
tab[1,2]
tab[2,1]
## [1] 46
## [1] 17
## [1] 5

 

The numbers in the square brackets refer to the position (as row and column co-ordinates) of the data item in the table not the values of the variables. We can extract data using the values of the row and column variables by enclosing the index values in double quotes (“). For example:

 

tab["1","1"]
## [1] 46

The two methods of extracting data may be combined. For example:

 

tab[1,"1"]
## [1] 46

 

We can calculate a risk ratio using the extracted data:

 

(tab[1,1]/(tab[1,1]+tab[1,2]))/(tab[2,1]/(tab[2,1]+tab[2,2]))

 

Which returns a risk ratio of

 

## [1] 2.044444

 

This is a tedious calculation to have to type in every time you need to calculate a risk ratio from a two-by-two table. It would be better to have a function that calculates and displays the risk ratio automatically. Fortunately, R allows us to do just that.

The function() function allows us to create new functions in R:

 

tab2by2 <- function(exposure, outcome) {}

 

This creates an empty function called tab2by2 that expects two parameters called exposure and outcome. We could type the whole function in at the R command prompt but it is easier to use a text editor:

 

fix(tab2by2)

 

This will start an editor with the empty tab2by2() function already loaded. We can now edit this function to make it do something useful:

function(exposure, outcome)
  {
  tab <- table(exposure, outcome)
  a <- tab[1,1]
  b <- tab[1,2]
  c <- tab[2,1]
  d <- tab[2,2]
  rr <- (a / (a + b)) / (c / (c + d))
  print(tab)
  print(rr) 
  }

 

Once you have made the changes shown above, check your work, save the file, and quit the editor. Before proceeding we should examine the tab2by2() function to make sure we understand what the function will do:

  • The first line defines tab2by2 as a function that expects to be given two parameters which are called exposure and outcome.

  • The body of the function (i.e. the work of the function) is enclosed within curly brackets ({}).

  • The first line of the body of the function creates a table object (tab) using the variables specified when the tab2by2() function is called (these are the parameters exposure and outcome).

  • The next line creates four new objects (called a, b, c, and d) which contain the values of the four cells in the two-by-two table.

  • The following line calculates the risk ratio using the objects a, b, c, and d and stores the result of the calculation in an object called rr.

  • The final two lines print the contents of the tab and rr objects.

Let’s try the tab2by2() function with our test data:

 

tab2by2(HAM, ILL)
##         outcome
## exposure  1  2
##        1 46 17
##        2  5  9
## [1] 2.044444

 

The tab2by2() function displays a table of HAM by ILL followed by the risk ratio calculated from the data in the table.

Try producing another table:

 

tab2by2(PASTA, ILL)
##         outcome
## exposure  1  2
##        1 25  3
##        2 26 23
## [1] 1.682692

 

Have a look at the R objects available to you:

 

ls()
## [1] "fem"     "salex"   "tab"     "tab2by2"

 

Note that there are no a, b, c, d, or rr objects.

Examine the tab object:

 

tab
##    ILL
## HAM  1  2
##   1 46 17
##   2  5  9

 

This is the table of HAM by ILL that you created earlier not the table of PASTA by ILL that was created by the tab2by2() function.

The tab, a, b, c, d, and rr objects in the tab2by2() function are local to that function and do not change anything outside of that function. This means that the tab object inside the function is independent of any object of the same name outside of the function.

When a function completes its work, all of the objects that are local to that function are automatically removed. This is useful as it means that you can use object names inside functions that will not interfere with objects of the same name that are stored elsewhere. It also means that you do not clutter up the R workspace with temporary objects.

Just to prove that tab in the tab2by2() function exists only in the tab2by2() function we can delete the tab object from the R workspace:

 

rm(tab)

Now try another call to the tab2by2() function:

 

tab2by2(FRUIT, ILL)
##         outcome
## exposure  1  2
##        1  1  4
##        2 49 22
## [1] 0.2897959

Now list the R objects available to you:

 

ls()
## [1] "fem"     "salex"   "tab2by2"

 

Note that there are no tab, a, b, c, d, or rr objects.

The tab2by2() function is very limited. It only displays a table and calculates and displays a simple ratio. A more useful function would also calculate and display a confidence interval for the risk ratio. This is what we will do now. Use the fix() function to edit the tab2by2() function:

 

fix(tab2by2)

 

We can now edit this function to calculate and display a 95% confidence interval for the risk ratio.

 

function(exposure, outcome) {
  tab <- table(exposure, outcome)
  a <- tab[1,1]
  b <- tab[1,2]
  c <- tab[2,1]
  d <- tab[2,2]
  rr <- (a / (a + b)) / (c / (c + d))
  se.log.rr <- sqrt((b / a) / (a + b) + (d / c) / (c + d)) 
  lci.rr <- exp(log(rr) - 1.96 * se.log.rr)
  uci.rr <- exp(log(rr) + 1.96 * se.log.rr)
  print(tab)
  print(rr)
  print(lci.rr)
  print(uci.rr)
}

Once you have made the changes shown above, check your work, save the file, and quit the editor. We should test our revised function:

 

tab2by2(EGGS, ILL)

 

which produces the following output:

 

##         outcome
## exposure  1  2
##        1 40  6
##        2 10 20
## [1] 2.608696
## [1] 1.553564
## [1] 4.38044

 

The function works but the output could be improved. Use the fix() function to edit the tab2by2() function:

 

function(exposure, outcome) {
  tab <- table(exposure, outcome)
  a <- tab[1,1]
  b <- tab[1,2]
  c <- tab[2,1]
  d <- tab[2,2]
  rr <- (a / (a + b)) / (c / (c + d))
  se.log.rr <- sqrt((b / a) / (a + b) + (d / c) / (c + d)) 
  lci.rr <- exp(log(rr) - 1.96 * se.log.rr)
  uci.rr <- exp(log(rr) + 1.96 * se.log.rr)
  print(tab)
  cat("\nRR :", rr,
      "\n95% CI :", lci.rr, uci.rr, "\n")
}

 

Once you have made the changes shown above, save the file and quit the editor.

Now we can test our function again:

 

tab2by2(EGGS, ILL)

 

Which produces the following output:

 

##         outcome
## exposure  1  2
##        1 40  6
##        2 10 20
## 
## RR : 2.608696 
## 95% CI : 1.553564 4.38044

 

The tab2by2() function displays output but does not behave like a standard R function in the sense that you cannot save the results of the tab2by2() function into an object:

 

test2by2 <- tab2by2(EGGS, ILL)
##         outcome
## exposure  1  2
##        1 40  6
##        2 10 20
## 
## RR : 2.608696 
## 95% CI : 1.553564 4.38044

 

displays output but does not save anything in the test2by2 object:

 

test2by2
## NULL

The returned value (NULL) means that test2by2 is an empty object. We will not worry about this at the moment as the tab2by2() function is good-enough for our current purposes. In Exercise 6 we will explore how to make our own functions behave like standard R functions.

We will now add the calculation of the odds ratio and its 95% confidence interval to the tab2by2() function using the fix() function.

There are two ways of doing this. We could either calculate the odds ratio from the table and use (e.g.) the method of Woolf to calculate the confidence interval:

 

or <- (a / b) / (c / d)
se.log.or <- sqrt(1 / a + 1 / b + 1 / c + 1 / d)
lci.or <- exp(log(or) - 1.96 * se.log.or)
uci.or <- exp(log(or) + 1.96 * se.log.or)
cat("\nOR     :", or,
    "\n95% CI :", lci.or, uci.or, "\n")

 

or use the output of the fisher.test() function:

 

ft <- fisher.test(tab)
cat("\nOR     :", ft$estimate,
    "\n95% CI :", ft$conf.int, "\n")

 

Note that we can refer to components of a function’s output using the same syntax as when we refer to columns in a data.frame (e.g. ft$estimate to examine the estimate of the odds ratio from the fisher.test() function stored in the object ft).

The names of elements in the output of a standard function such as fisher.test() can be found in the documentation or the help system. For example:

 

help(fisher.test)

Output elements are listed under the Value heading.

Revise the tab2by2() function to include the calculation of the odds ratio and the 95% confidence interval. The revised function will look something like this:

 

function(exposure, outcome) {
  tab <- table(exposure, outcome)
  a <- tab[1,1]
  b <- tab[1,2]
  c <- tab[2,1]
  d <- tab[2,2]
  rr <- (a / (a + b)) / (c / (c + d))
  se.log.rr <- sqrt((b / a) / (a + b) + (d / c) / (c + d)) 
  lci.rr <- exp(log(rr) - 1.96 * se.log.rr)
  uci.rr <- exp(log(rr) + 1.96 * se.log.rr)
  or <- (a / b) / (c / d)
  se.log.or <- sqrt(1 / a + 1 / b + 1 / c + 1 / d)
  lci.or <- exp(log(or) - 1.96 * se.log.or)
  uci.or <- exp(log(or) + 1.96 * se.log.or)
  ft <- fisher.test(tab)
  cat("\n")
  print(tab)
  
  cat("\nRelative Risk     :", rr,
      "\n95% CI            :", lci.rr, uci.rr, "\n")
  
  cat("\nSample Odds Ratio :", or,
      "\n95% CI            :", lci.or, uci.or, "\n")

  cat("\nMLE Odds Ratio    :", ft$estimate,
      "\n95% CI            :",  ft$conf.int, "\n\n")
}

 

Once you have made the changes shown above, check your work, save the file, and quit the editor.

Test the tab2by2() function when you have added the calculation of the odds ratio and its 95% confidence interval.

Now that we have a function that will calculate risk ratios and odds ratios with confidence intervals from a two- by-two table we can use it to analyse the salex data:

 

tab2by2(HAM, ILL)
tab2by2(BEEF, ILL)
tab2by2(EGGS, ILL)
tab2by2(MUSHROOM, ILL)
tab2by2(PEPPER, ILL)
tab2by2(PORKPIE, ILL)
tab2by2(PASTA, ILL)
tab2by2(RICE, ILL)
tab2by2(LETTUCE, ILL)
tab2by2(TOMATO, ILL)
tab2by2(COLESLAW, ILL)
tab2by2(CRISPS, ILL)
tab2by2(PEACHCAKE, ILL)
tab2by2(CHOCOLATE, ILL)
tab2by2(FRUIT, ILL)
tab2by2(TRIFLE, ILL)
tab2by2(ALMONDS, ILL)
## 
##         outcome
## exposure  1  2
##        1 46 17
##        2  5  9
## 
## Relative Risk     : 2.044444 
## 95% CI            : 0.9964841 4.194501 
## 
## Sample Odds Ratio : 4.870588 
## 95% CI            : 1.428423 16.60756 
## 
## MLE Odds Ratio    : 4.75649 
## 95% CI            : 1.22777 20.82921
## 
##         outcome
## exposure  1  2
##        1 45 22
##        2  6  4
## 
## Relative Risk     : 1.119403 
## 95% CI            : 0.6568821 1.907592 
## 
## Sample Odds Ratio : 1.363636 
## 95% CI            : 0.3485746 5.334594 
## 
## MLE Odds Ratio    : 1.357903 
## 95% CI            : 0.2547114 6.428414
## 
##         outcome
## exposure  1  2
##        1 40  6
##        2 10 20
## 
## Relative Risk     : 2.608696 
## 95% CI            : 1.553564 4.38044 
## 
## Sample Odds Ratio : 13.33333 
## 95% CI            : 4.240168 41.92706 
## 
## MLE Odds Ratio    : 12.74512 
## 95% CI            : 3.762787 50.05419
## 
##         outcome
## exposure  1  2
##        1 24  6
##        2 25 19
## 
## Relative Risk     : 1.408 
## 95% CI            : 1.028944 1.926697 
## 
## Sample Odds Ratio : 3.04 
## 95% CI            : 1.037274 8.909506 
## 
## MLE Odds Ratio    : 2.995207 
## 95% CI            : 0.9421008 10.7953
## 
##         outcome
## exposure  1  2
##        1 24  3
##        2 23 22
## 
## Relative Risk     : 1.73913 
## 95% CI            : 1.26876 2.383882 
## 
## Sample Odds Ratio : 7.652174 
## 95% CI            : 2.013718 29.07844 
## 
## MLE Odds Ratio    : 7.448216 
## 95% CI            : 1.861728 44.12015
## 
##         outcome
## exposure  1  2
##        1 21  9
##        2 29 17
## 
## Relative Risk     : 1.110345 
## 95% CI            : 0.8044752 1.532509 
## 
## Sample Odds Ratio : 1.367816 
## 95% CI            : 0.5113158 3.659032 
## 
## MLE Odds Ratio    : 1.362228 
## 95% CI            : 0.4636016 4.190667
## 
##         outcome
## exposure  1  2
##        1 25  3
##        2 26 23
## 
## Relative Risk     : 1.682692 
## 95% CI            : 1.255392 2.255433 
## 
## Sample Odds Ratio : 7.371795 
## 95% CI            : 1.964371 27.66451 
## 
## MLE Odds Ratio    : 7.195422 
## 95% CI            : 1.829867 42.07488
## 
##         outcome
## exposure  1  2
##        1 28  4
##        2 23 22
## 
## Relative Risk     : 1.711957 
## 95% CI            : 1.250197 2.344268 
## 
## Sample Odds Ratio : 6.695652 
## 95% CI            : 2.017327 22.22335 
## 
## MLE Odds Ratio    : 6.532868 
## 95% CI            : 1.852297 29.84928
## 
##         outcome
## exposure  1  2
##        1 28  1
##        2 23 25
## 
## Relative Risk     : 2.014993 
## 95% CI            : 1.488481 2.727744 
## 
## Sample Odds Ratio : 30.43478 
## 95% CI            : 3.826938 242.041 
## 
## MLE Odds Ratio    : 29.32825 
## 95% CI            : 4.161299 1284.306
## 
##         outcome
## exposure  1  2
##        1 29  9
##        2 22 17
## 
## Relative Risk     : 1.352871 
## 95% CI            : 0.974698 1.877771 
## 
## Sample Odds Ratio : 2.489899 
## 95% CI            : 0.9347213 6.632562 
## 
## MLE Odds Ratio    : 2.459981 
## 95% CI            : 0.8467562 7.558026
## 
##         outcome
## exposure  1  2
##        1 29  3
##        2 21 23
## 
## Relative Risk     : 1.89881 
## 95% CI            : 1.366876 2.63775 
## 
## Sample Odds Ratio : 10.5873 
## 95% CI            : 2.806364 39.9417 
## 
## MLE Odds Ratio    : 10.26269 
## 95% CI            : 2.600771 60.35431
## 
##         outcome
## exposure  1  2
##        1 21 10
##        2 30 16
## 
## Relative Risk     : 1.03871 
## 95% CI            : 0.7529065 1.433004 
## 
## Sample Odds Ratio : 1.12 
## 95% CI            : 0.4258139 2.945888 
## 
## MLE Odds Ratio    : 1.118358 
## 95% CI            : 0.3858206 3.340535
## 
##         outcome
## exposure  1  2
##        1  2  2
##        2 49 24
## 
## Relative Risk     : 0.744898 
## 95% CI            : 0.27594 2.010846 
## 
## Sample Odds Ratio : 0.4897959 
## 95% CI            : 0.06497947 3.691936 
## 
## MLE Odds Ratio    : 0.4947099 
## 95% CI            : 0.03393887 7.209143
## 
##         outcome
## exposure  1  2
##        1 12  2
##        2 38 24
## 
## Relative Risk     : 1.398496 
## 95% CI            : 1.045064 1.871456 
## 
## Sample Odds Ratio : 3.789474 
## 95% CI            : 0.7791326 18.43089 
## 
## MLE Odds Ratio    : 3.733535 
## 95% CI            : 0.7318646 37.28268
## 
##         outcome
## exposure  1  2
##        1  1  4
##        2 49 22
## 
## Relative Risk     : 0.2897959 
## 95% CI            : 0.04985828 1.684408 
## 
## Sample Odds Ratio : 0.1122449 
## 95% CI            : 0.01185022 1.06318 
## 
## MLE Odds Ratio    : 0.1157141 
## 95% CI            : 0.002240848 1.256134
## 
##         outcome
## exposure  1  2
##        1 19  5
##        2 32 21
## 
## Relative Risk     : 1.311198 
## 95% CI            : 0.9718621 1.769016 
## 
## Sample Odds Ratio : 2.49375 
## 95% CI            : 0.8067804 7.708156 
## 
## MLE Odds Ratio    : 2.465794 
## 95% CI            : 0.7363311 9.778463
## 
##         outcome
## exposure  1  2
##        1  3  3
##        2 38 19
## 
## Relative Risk     : 0.75 
## 95% CI            : 0.3300089 1.7045 
## 
## Sample Odds Ratio : 0.5 
## 95% CI            : 0.09203498 2.716358 
## 
## MLE Odds Ratio    : 0.505905 
## 95% CI            : 0.06170211 4.141891

 

Make a note of any positive associations (i.e. with a risk ratio > 1 with a 95% confidence intervals that does not include one). We will use these for the next exercise when we will use logistic regression to analyse this data.

Save the tab2by2() function:

 

save(tab2by2, file = "tab2by2.r")

 

We can now quit R:

 

q()

 

For this exercise there is no need to save the workspace image so click the No or Don’t Save button (GUI) or enter n when prompted to save the workspace image (terminal).

2.1 Summary

  • R objects contain information that can be examined and manipulated.

  • R can be extended by writing new functions.

  • New functions can perform simple or complex data analysis.

  • New functions can be composed of parts of existing function.

  • New functions can be saved and used in subsequent R sessions.

  • Objects defined within functions are local to that function and only exist while that function is being used. This means that you can re-use meaningful names within functions without them interfering with each other.