11 Hypothesis testing with randomization, Part 2
2.0
11.1 Introduction
Now that we have learned about hypothesis testing, we’ll explore a different example. Although the rubric for performing the hypothesis test will not change, the individual steps will be implemented in a different way due to the research question we’re asking and the type of data used to answer it.
11.1.1 Install new packages
If you are using RStudio Workbench, you do not need to install any packages. (Any packages you need should already be installed by the server administrators.)
If you are using R and RStudio on your own machine instead of accessing RStudio Workbench through a browser, you’ll need to type the following command at the Console:
install.packages("MASS")
11.1.2 Download the R notebook file
Check the upper-right corner in RStudio to make sure you’re in your intro_stats
project. Then click on the following link to download this chapter as an R notebook file (.Rmd
).
Once the file is downloaded, move it to your project folder in RStudio and open it there.
11.2 Load packages
In additional to tidyverse
and janitor
, we load the MASS
package to access the Melanoma
data on patients in Denmark with malignant melanoma, and the infer
package for inference tools.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
11.3 Our research question
We know that certain types of cancer are more common among females or males. Is there a sex bias among patients with malignant melanoma?
Let’s jump into the “Exploratory data analysis” part of the rubric first.
11.4 Exploratory data analysis
11.4.1 Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.
You can look at the help file by typing ?Melanoma
at the Console. However, do not put that command here in a code chunk. The R Notebook has no way of displaying a help file when it’s processed. Be careful: there’s another data set called melanoma
with a lower-case “m”. Make sure you are using an uppercase “M”.
There is a reference at the bottom of the help file.
Exercise 1
Using the reference in the help file, do an internet search to find the source of this data. How can you tell that this reference is not, in fact, a reference to a study of cancer patients in Denmark?
Please write up your answer here.
From the exercise above, we can see that it will be very difficult, if not impossible, to discover anything useful about the true provenance of the data (unless you happen to have a copy of that textbook, which in theory provided another more primary source). We will not know, for example, how the data was collected and if the patients consented to having their data shared publicly. The data is suitably anonymized, though, so we don’t have any serious concerns about the privacy of the data. Having said that, if a condition is rare enough, a dedicated research can often “de-anonymize” data by cross-referencing information in the data to other kinds of public records. But melanoma is not particularly rare. At any rate, all we can do is assume that the textbook authors obtained the data from a source that used proper procedures for collecting and handling the data.
We print the data frame:
Melanoma
Use glimpse
to examine the structure of the data:
glimpse(Melanoma)
## Rows: 205
## Columns: 7
## $ time <int> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355, 386,…
## $ status <int> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, …
## $ sex <int> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, …
## $ age <int> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63, 14, …
## $ year <int> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, 1971, …
## $ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88, 7.41…
## $ ulcer <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
11.4.2 Prepare the data for analysis.
It appears that sex
is coded as an integer. You will recall that we need to convert it to a factor variable since it is categorical, not numerical.
Exercise 2
According to the help file, which number corresponds to which sex?
Please write up your answer here.
We can convert a numerical variable a couple of different ways. In Chapter 3, we used the as_factor
command. That command works fine, but it doesn’t give you a way to change the levels of the variable. In other words, if we used as_factor
here, we would get a factor variable that still contained zeroes and ones.
Instead, we will use the factor
command. It allows us to manually relabel the levels. The levels
argument requires a vector (with c
) of the current levels, and the labels
argument requires a vector listing the new names you want to assign, as follows:
Melanoma <- Melanoma %>%
mutate(sex_fct = factor(sex, levels = c(0, 1), labels = c("female", "male")))
glimpse(Melanoma)
## Rows: 205
## Columns: 8
## $ time <int> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355, 386,…
## $ status <int> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, …
## $ sex <int> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, …
## $ age <int> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63, 14, …
## $ year <int> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, 1971, …
## $ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88, 7.41…
## $ ulcer <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ sex_fct <fct> male, male, male, female, male, male, male, female, male, fe…
You should check to make sure the first few entries of sex_fct
agree with the numbers in the sex
variable according to the labels explained in the help file. (If not, it means that you put the levels
in one order and the labels
in a different order.)
11.4.3 Make tables or plots to explore the data visually.
We only have one categorical variable, so we only need a frequency table. Since we are concerned with proportions, we’ll also look at a relative frequency table which the tabyl
command provides for free.
tabyl(Melanoma, sex_fct) %>%
adorn_totals()
11.5 The logic of inference and randomization
This is a good place to pause and remember why statistical inference is important. There are certainly more females than males in this data set. So why don’t we just show the table above, declare females are more likely to have malignant melanoma, and then go home?
Think back to coin flips. Even though there was a 50% chance of seeing heads, did that mean that exactly half of our flips came up heads? No. We have to acknowledge sampling variability: even if the truth were 50%, when w sample, we could accidentally get more or less than 50%, just by pure chance alone. Perhaps these 205 patients just happen to have more females than average.
The key, then, is to figure out if 61.5% is significantly larger than 50%, or if a number like 61.5% (or one even more extreme) could easily come about from random chance.
As we know from the last chapter, we can run a formal hypothesis test to find out. As we do so, make note of the things that are the same and the things that have changed from the last hypothesis tests you ran. For example, we are not comparing two groups anymore. We have one group of patients, and all we’re doing is measuring the percentage of this group that is female. It’s tempting to think that we’re comparing males and females, but that’s not the case. We are not using sex to divide our data into two groups for the purpose of exploring whether some other variable differs between men and women. We just have one sample. “Female” and “Male” are simply categories in a single categorical variable. Also, because we are only asking about one variable (sex_fct
), the mathematical form of the hypotheses will look a little different.
Because this is no longer a question about two variables being independent or associated, the “permuting” idea we’ve been using no longer makes sense. So what does make sense?
It helps to start by figuring out what our null hypothesis is. Remember, our question of interest is whether there is a sex bias in malignant melanoma. In other words, are there more or fewer females than males with malignant melanoma? As this is our research question, it will be the alternative hypothesis. So what is the null? What is the “default” situation in which nothing interesting is going on? Well, there would be no sex bias. In other words, there would be the same number of females and males with malignant melanoma. Or another way of saying that—with respect to the “success” condition of being female that we discussed earlier—is that females comprise 50% of all patients with malignant melanoma.
Okay, given our philosophy about the null hypothesis, let’s take the skeptical position and assume that, indeed, 50% of all malignant melanoma patients in our population are female. Then let’s take a sample of 205 patients. We can’t get exactly 50% females from a sample of 205 (that would be 102.5 females!), so what numbers can we get?
Randomization will tell us. What kind of randomization? As we come across each patient in our sample, there is a 50% chance of them being female. So instead of sampling real patients, what if we just flipped a coin? A simulated coin flip will come up heads just as often as our patients will be female under the assumption of the null.
This brings us full circle, back to the first randomization idea we explored. We can simulate coin flips, graph our results, and calculate a P-value. More specifically, we’ll flip a coin 205 times to represent sampling 205 patients. Then we’ll repeat this procedure a bunch of times and establish a range of plausible percentages that can come about by chance from this procedure. Instead of doing coin flips with the rflip
command as we did then, however, we’ll use our new favorite friend, the infer
package.
Let’s dive back into the remaining steps of the formal hypothesis test.
11.6 Hypotheses
11.6.1 Identify the sample (or samples) and a reasonable population (or populations) of interest.
The sample consists of 205 patients from Denmark with malignant melanoma. Our population is presumably all patients with malignant melanoma, although in checking conditions below, we’ll take care to discuss whether patients in Denmark are representative of patients elsewhere.
11.7 Model
11.7.1 Identify the sampling distribution model.
We will randomize to simulate the sampling distribution.
11.7.2 Check the relevant conditions to ensure that model assumptions are met.
- Random
- As mentioned above, these 205 patients are not a random sample of all people with malignant melanoma. We don’t even have any evidence that they are a random sample of melanoma patients in Denmark. Without such evidence, we have to hope that these 205 patients are representative of all patients who have malignant melanoma. Unless there’s something special about Danes in terms of their genetics or diet or something like that, one could imagine that their physiology makes them just as susceptible to melanoma as anyone else. More specifically, though, our question is about females and males getting malignant melanoma. Perhaps there are more female sunbathers in Denmark than in other countries. That might make Danes unrepresentative in terms of the gender balance among melanoma patients. We should be cautious in interpreting any conclusion we might reach in light of these doubts.
- 10%
- Whether in Denmark or not, given that melanoma is a fairly common form of cancer, I assume 205 is less than 10% of all patients with malignant melanoma.
11.8 Mechanics
11.8.1 Compute the test statistic.
female_prop <- Melanoma %>%
observe(response = sex_fct, success = "female",
stat = "prop")
female_prop
Note: Pay close attention to the difference in the observe
command above. Unlike in the last chapter, we don’t have any tildes. That’s because there are not two variables involved. There is only one variable, which observe
needs to see as the “response” variable. (Don’t forget to use the factor version sex_fct
and not sex
!) We still have to specify a “success” condition. Since the hypotheses are about measuring females, we have to tell observe
to calculate the proportion of females. Finally, the stat
is no longer “diff in props” There are not two proportions with which to find a difference. There is just one proportion, hence, “prop”.
11.8.2 Report the test statistic in context (when possible).
The observed percentage of females with melanoma in our sample is 61.4634146%.
Note: As explained in the last chapter, we have to use pull
to pull out the number from the female_prop
tibble.
11.8.3 Plot the null distribution.
Since this is the first step for which we need the simulated values, it will be convenient to run the simulation here. We’ll need to set the seed as well.
set.seed(42)
melanoma_test <- Melanoma %>%
specify(response = sex_fct, success = "female") %>%
hypothesize(null = "point", p = 0.5) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "prop")
melanoma_test
This list of proportions is the sampling distribution. It represents possible sample proportions of females with melanoma under the assumption that the null is true. In other words, even if the “true” proportion of female melanoma patients were 0.5, these are all values that can result from random samples.
In the hypothesize
command, we use “point” to tell infer
that we want the null to be centered at the point 0.5. In the generate
command, we need to specify the type
as “draw” instead of “permute”. We are not shuffling any values here; we are “drawing” values from a probability distribution like coin flips. Everything else in the command is pretty self-explanatory.
The value of our test statistic, female_prop
, is 0.6146341. It appears in the right tail:
melanoma_test %>%
visualize() +
shade_p_value(obs_stat = female_prop, direction = "two-sided")
Although the line only appears on the right, keep in mind that we are conducting a two-sided test, so we are interested in values more extreme than the red line on the right, but also more extreme than a similarly placed line on the left.
Exercise 3
The red line sits at about 0.615. If you were to draw a red line on the above histogram that represented a value equally distant from 0.5, but on the left instead of the right, where would that line be? Do a little arithmetic to figure it out and show your work.
Please write up your answer here.
11.8.4 Calculate the P-value.
melanoma_test %>%
get_p_value(obs_stat = female_prop, direction = "two-sided")
## Warning: Please be cautious in reporting a p-value of 0. This result is an
## approximation based on the number of `reps` chosen in the `generate()` step.
## See `?get_p_value()` for more information.
The P-value appears to be zero. Indeed, among the 1000 simulated values, we saw none that exceeded 0.615 and none that were less than 0.385. However, a true P-value can never be zero. If you did millions or billions of simulations (please don’t try!), surely there would be one or two with even more extreme values. In cases when the P-value is really, really tiny, it is traditional to report \(P < 0.001\). It is incorrect to say \(P = 0\).
11.8.5 Interpret the P-value as a probability given the null.
\(P < 0.001\). If there were no sex bias in malignant melanoma patients, there would be less than a 0.1% chance of seeing a percentage of females at least as extreme as the one we saw in our data.
Note: Don’t forget to interpret the P-value in a contextually meaningful way. The P-value is the probability under the assumption of the null hypothesis of seeing data at least as extreme as the data we saw. In this context, that means that if we assume 50% of patients are female, it would be extremely rare to see more than 61.5% or less than 38.5% females in a sample of size 205.
11.9 Conclusion
11.9.2 State (but do not overstate) a contextually meaningful conclusion.
There is sufficient evidence that there is a sex bias in patients who suffer from malignant melanoma.
11.9.3 Express reservations or uncertainty about the generalizability of the conclusion.
We have no idea how these patients were sampled. Are these all the patients in Denmark with malignant melanoma over a certain period of time? Were they part of a convenience sample? As a result of our uncertainly about the sampling process, we can’t be sure if the results generalize to a larger population, either in Denmark or especially outside of Denmark.
11.9.4 Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses.
As we rejected the null, we run the risk of making a Type I error. If we have made such an error, that would mean that patients with malignant melanoma are equally likely to be male or female, but that we got a sample with an unusual number of female patients.
11.10 Your turn
Determine if the percentage of patients in Denmark with malignant melanoma who also have an ulcerated tumor (measured with the ulcer
variable) is significantly different from 50%.
As before, you have the outline of the rubric for inference below. Some of the steps will be the same or similar to steps in the example above. It is perfectly okay to copy and paste R code, making the necessary changes. It is not okay to copy and paste text. You need to put everything into your own words.
The template below is exactly the same as in the appendix (Rubric for inference) up to the part about confidence intervals which we haven’t learned yet.
Exploratory data analysis
Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.
# Add code here to understand the data.
Hypotheses
Identify the sample (or samples) and a reasonable population (or populations) of interest.
Please write up your answer here.
Mechanics
Plot the null distribution.
set.seed(42)
# Add code here to simulate the null distribution.
# Run 1000 simulations like in the earlier example.
# Add code here to plot the null distribution.
11.11 Conclusion
Now you have seen two fully-worked examples of hypothesis tests using randomization, and you have created two more examples on your own. Hopefully, the logic of inference and the process of running a formal hypothesis test are starting to make sense.
Keep in mind that the outline of steps will not change. However, the way each step is carried out will vary from problem to problem. Not only does the context change (one example involved sex discrimination, the other melanoma patients), but the statistics you compute also change (one example compared proportions from two samples and the other only had one proportion from a single sample). Pay close attention to the research question and the data that will be used to answer that question. That will be the only information you have to help you know which hypothesis test applies.
11.11.1 Preparing and submitting your assignment
- From the “Run” menu, select “Restart R and Run All Chunks”.
- Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors.
- Spell check your document by clicking the icon with “ABC” and a check mark.
- Hit the “Preview” button one last time to generate the final draft of the
.nb.html
file. - Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1–5 again.
If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.