11  Hypothesis testing with randomization, Part 2

Functions introduced in this chapter

factor

11.1 Introduction

Now that we have learned about hypothesis testing, we’ll explore a different example. Although the rubric for performing the hypothesis test will not change, the individual steps will be implemented in a different way due to the research question we’re asking and the type of data used to answer it.

11.1.1 Install new packages

Type the following at the Console:

install.packages("MASS")

11.1.2 Download the Quarto file

Look at either the top (Posit Cloud) or the upper right corner of the RStudio screen to make sure you are in your intro_stats project.

Then click on the following link to download this chapter as a Quarto file (.qmd).

https://vectorposse.github.io/intro_stats/chapter_downloads/11-hypothesis_testing_with_randomization_2.qmd

Once the file is downloaded, move it to your project folder in RStudio and open it there.

11.1.3 Restart R and run all chunks

In RStudio, select “Restart R and Run All Chunks” from the “Run” menu.

11.2 Load packages

In additional to tidyverse and janitor, we load the MASS package to access the melanoma data on patients in Denmark with malignant melanoma, and the infer package for inference tools.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor)

Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test
library(MASS)

Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select
library(infer)

11.3 Our research question

We know that certain types of cancer are more common among females or males. Is there a sex bias among patients with malignant melanoma?

Let’s jump into the “Exploratory data analysis” part of the rubric first.

11.4 Exploratory data analysis

11.4.1 Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.

You can look at the help file by typing ?Melanoma at the Console. However, do not put that command here in a code chunk. The Quarto document has no way of displaying a help file when it’s processed. Be careful: there’s another data set called melanoma with a lower-case “m”. Make sure you are using an uppercase “M”.

There is a reference at the bottom of the help file.

Exercise 1

At the bottom of the help file, there is a “Source” listed (by authors Andersen, Borgan, Gill, and Keiding). Look up this reference on the internet. How can you tell that this reference is not, in fact, a direct reference to a study of cancer patients in Denmark?

Please write up your answer here.


From the exercise above, we can see that it will be very difficult, if not impossible, to discover anything useful about the true provenance of the data (unless you happen to have a copy of that textbook, which in theory provided another more primary source). We will not know, for example, how the data was collected and if the patients consented to having their data shared publicly. The data is suitably anonymized, though, so we don’t have any serious concerns about the privacy of the data. Having said that, if a condition is rare enough, a dedicated research can often “de-anonymize” data by cross-referencing information in the data to other kinds of public records. But melanoma is not particularly rare. At any rate, all we can do is assume that the textbook authors obtained the data from a source that used proper procedures for collecting and handling the data.

We print the data frame:

Melanoma

Use glimpse to examine the structure of the data:

glimpse(Melanoma)
Rows: 205
Columns: 7
$ time      <int> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355, 386,…
$ status    <int> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, …
$ sex       <int> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, …
$ age       <int> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63, 14, …
$ year      <int> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, 1971, …
$ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88, 7.41…
$ ulcer     <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …

11.4.2 Prepare the data for analysis.

It appears that sex is coded as an integer. You will recall that we need to convert it to a factor variable since it is categorical, not numerical.

Exercise 2

According to the help file, which number corresponds to which sex?

Please write up your answer here.


We can convert a numerical variable a couple of different ways. In Chapter 3, we used the as_factor command. That command works fine, but it doesn’t give you a way to change the levels of the variable. In other words, if we used as_factor here, we would get a factor variable that still contained zeroes and ones.

Instead, we will use the factor command. It allows us to manually relabel the levels. The levels argument requires a vector (with c) of the current levels, and the labels argument requires a vector listing the new names you want to assign, as follows:

Melanoma <- Melanoma |>
  mutate(sex_fct = factor(sex,
                          levels = c(0, 1),
                          labels = c("female", "male")))
glimpse(Melanoma)
Rows: 205
Columns: 8
$ time      <int> 10, 30, 35, 99, 185, 204, 210, 232, 232, 279, 295, 355, 386,…
$ status    <int> 3, 3, 2, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 3, 1, 1, 1, 1, …
$ sex       <int> 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, …
$ age       <int> 76, 56, 41, 71, 52, 28, 77, 60, 49, 68, 53, 64, 68, 63, 14, …
$ year      <int> 1972, 1968, 1977, 1968, 1965, 1971, 1972, 1974, 1968, 1971, …
$ thickness <dbl> 6.76, 0.65, 1.34, 2.90, 12.08, 4.84, 5.16, 3.22, 12.88, 7.41…
$ ulcer     <int> 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
$ sex_fct   <fct> male, male, male, female, male, male, male, female, male, fe…

You should check to make sure the first few entries of sex_fct agree with the numbers in the sex variable according to the labels explained in the help file. (If not, it means that you put the levels in one order and the labels in a different order.)

We should also check if there are any rows with missing data:

Melanoma |>
  drop_na(sex_fct)

There are still 205 rows, so that means that no data is missing from the sex_fct variable.

11.4.3 Make tables or plots to explore the data visually.

We only have one categorical variable, so we only need a frequency table. Since we are concerned with proportions, we’ll also look at a relative frequency table which the tabyl command provides for free.

tabyl(Melanoma, sex_fct) |>
  adorn_totals()

11.5 The logic of inference and randomization

This is a good place to pause and remember why statistical inference is important. There are certainly more females than males in this data set. So why don’t we just show the table above, declare females are more likely to have malignant melanoma, and then go home?

Think back to coin flips. Even though there was a 50% chance of seeing heads, did that mean that exactly half of our flips came up heads? No. We have to acknowledge sampling variability: even if the truth were 50%, when we sample, we could accidentally get more or less than 50%, just by pure chance alone. Perhaps these 205 patients just happen to have more females than average.

The key, then, is to figure out if 61.5% is a “statistically discernible” difference from 50%. In other words, is it possible that a number like 61.5% (or one even more extreme) could easily come about from random chance? Or is 61.5% far enough away from 50% that it is not likely to be a random fluke?1

As we know from the last chapter, we can run a formal hypothesis test to find out. As we do so, make note of the things that are the same and the things that have changed from the last hypothesis tests you ran. For example, we are not comparing two groups anymore. We have one group of patients, and all we’re doing is measuring the percentage of this group that is female. It’s tempting to think that we’re comparing males and females, but that’s not the case. We are not using sex to divide our data into two groups for the purpose of exploring whether some other variable differs between men and women. We just have one sample. “Female” and “Male” are simply categories in a single categorical variable. Also, because we are only asking about one variable (sex_fct), the mathematical form of the hypotheses will look a little different.

Because this is no longer a question about two variables being independent or associated, the “permuting” idea we’ve been using no longer makes sense. So what does make sense?

It helps to start by figuring out what our null hypothesis is. Remember, our question of interest is whether there is a sex bias in malignant melanoma. In other words, are there more or fewer females than males with malignant melanoma? As this is our research question, it will be the alternative hypothesis. So what is the null? What is the “default” situation in which nothing interesting is going on? Well, there would be no sex bias. In other words, there would be the same number of females and males with malignant melanoma. Or another way of saying that—with respect to the “success” condition of being female—is that females comprise 50% of all patients with malignant melanoma.

Okay, given our philosophy about the null hypothesis, let’s take the skeptical position and assume that, indeed, 50% of all malignant melanoma patients in our population are female. Then let’s take a sample of 205 patients. We can’t get exactly 50% females from a sample of 205 (that would be 102.5 females!), so what numbers can we get?

Randomization will tell us. What kind of randomization? As we come across each patient in our sample, there is a 50% chance of them being female. So instead of sampling real patients, what if we just flipped a coin? A simulated coin flip will come up heads just as often as our patients will be female under the assumption of the null.

This brings us full circle, back to the first randomization idea we explored. We can simulate coin flips, graph our results, and calculate a P-value. More specifically, we’ll flip a coin 205 times to represent sampling 205 patients. Then we’ll repeat this procedure a bunch of times and establish a range of plausible percentages that can come about by chance from this procedure. Instead of doing coin flips with the rflip command as we did then, however, we’ll use our new favorite friend, the infer package.

Let’s dive back into the remaining steps of the formal hypothesis test.

11.6 Hypotheses

11.6.1 Identify the sample (or samples) and a reasonable population (or populations) of interest.

The sample consists of 205 patients from Denmark with malignant melanoma. Our population is presumably all patients with malignant melanoma, although in checking conditions below, we’ll take care to discuss whether patients in Denmark are representative of patients elsewhere.

11.6.2 Express the null and alternative hypotheses as contextually meaningful full sentences.

\(H_{0}:\) Half of malignant melanoma patients are female.

\(H_{A}:\) There is a sex bias among patients with malignant melanoma (meaning that females are either over-represented or under-represented).

11.6.3 Express the null and alternative hypotheses in symbols (when possible).

\(H_{0}: p_{female} = 0.5\)

\(H_{A}: p_{female} \neq 0.5\)

11.7 Model

11.7.1 Identify the sampling distribution model.

We will randomize to simulate the sampling distribution.

11.7.2 Check the relevant conditions to ensure that model assumptions are met.

  • Random
    • As mentioned above, these 205 patients are not a random sample of all people with malignant melanoma. We don’t even have any evidence that they are a random sample of melanoma patients in Denmark. Without such evidence, we have to hope that these 205 patients are representative of all patients who have malignant melanoma. Unless there’s something special about Danes in terms of their genetics or diet or something like that, one could imagine that their physiology makes them just as susceptible to melanoma as anyone else. More specifically, though, our question is about females and males getting malignant melanoma. Perhaps there are more female sunbathers in Denmark than in other countries. That might make Danes unrepresentative in terms of the gender balance among melanoma patients. We should be cautious in interpreting any conclusion we might reach in light of these doubts.
  • 10%
    • Whether in Denmark or not, given that melanoma is a fairly common form of cancer, we assume 205 is less than 10% of all patients with malignant melanoma.

11.8 Mechanics

11.8.1 Compute the test statistic.

female_prop <- Melanoma |>
  observe(response = sex_fct, success = "female",
          stat = "prop")
female_prop

Note: Pay close attention to the difference in the observe command above. Unlike in the last chapter, we don’t have any tildes. That’s because there are not two variables involved. There is only one variable, which observe needs to see as the “response” variable. (Don’t forget to use the factor version sex_fct and not sex!) We still have to specify a “success” condition. Since the hypotheses are about measuring females, we have to tell observe to calculate the proportion of females. Finally, the stat is no longer “diff in props” There are not two proportions with which to find a difference. There is just one proportion, hence, prop.

11.8.2 Report the test statistic in context (when possible).

The observed percentage of females with melanoma in our sample is 61.4634146%.

Note: As explained in the last chapter, we have to use pull to pull out the number from the female_prop tibble.

11.8.3 Plot the null distribution.

set.seed(42)
melanoma_test <- Melanoma |>
  specify(response = sex_fct, success = "female") |>
  hypothesize(null = "point", p = 0.5) |>
  generate(reps = 1000, type = "draw") |>
  calculate(stat = "prop")
melanoma_test
melanoma_test |>
  visualize() +
  shade_p_value(obs_stat = female_prop, direction = "two-sided")

Note: There will always be two code chunks in this step. The first performs the simulation (1000 “coin flips”). The output is a list of 1000 simulated proportions. This is the sampling distribution. It represents possible sample proportions of females with melanoma under the assumption that the null is true. In other words, even if the “true” proportion of female melanoma patients were 0.5, these are all values that can result from random samples.

In the hypothesize command, we use “point” to tell infer that we want the null to be centered at the point 0.5. In the generate command, we need to specify the type as “draw” instead of “permute”. We are not shuffling any values here; we are “drawing” values from a probability distribution like coin flips. Everything else in the command is pretty self-explanatory.

The second code chunk creates a histogram of the simulated values under the assumption of the null. The value of our test statistic, female_prop (0.6146341) appears in the right tail as a red line. Although the line only appears on the right, keep in mind that we are conducting a two-sided test, so we are interested in values more extreme than the red line on the right, but also more extreme than a similarly placed (invisible) line on the left.

Exercise 3

The red line sits at about 0.615. If you were to draw a red line on the above histogram that represented a value equally distant from 0.5, but on the left instead of the right, where would that line be? Do a little arithmetic to figure it out and show your work.

Please write up your answer here.

11.8.4 Calculate the P-value.

P1 <- melanoma_test |>
  get_p_value(obs_stat = female_prop, direction = "two-sided")
Warning: Please be cautious in reporting a p-value of 0. This result is an approximation
based on the number of `reps` chosen in the `generate()` step.
ℹ See `get_p_value()` (`?infer::get_p_value()`) for more information.
P1

The P-value appears to be zero. If we were to report this using inline R code, we would say that the P-value was 0.

Indeed, among the 1000 simulated values, we saw none that exceeded 0.615 and none that were less than 0.385. However, a true P-value can never be zero. If you did millions or billions of simulations (please don’t try!), surely there would be one or two with even more extreme values. In cases when the P-value is really, really tiny, it is traditional to report \(P < 0.001\). It is incorrect to say \(P = 0\).

11.8.5 Interpret the P-value as a probability given the null.

\(P < 0.001\). If there were no sex bias in malignant melanoma patients, there would be less than a 0.1% chance of seeing a percentage of females at least as extreme as the one we saw in our data.

Note: Don’t forget to interpret the P-value in a contextually meaningful way. The P-value is the probability under the assumption of the null hypothesis of seeing data at least as extreme as the data we saw. In this context, that means that if we assume 50% of patients are female, it would be extremely rare to see more than 61.5% or less than 38.5% females in a sample of size 205.

Also note that if the P-value were not listed as 0, we would want to use inline code to report our sentence: “If there were no sex bias in malignant melanoma patients, there would be a 0% chance of seeing a percentage of females at least as extreme as the one we saw in our data.” But this sentence is incorrect here because it mistakenly calculates the P-value as if it were exactly 0 (which it never can be).

11.9 Conclusion

11.9.1 State the statistical conclusion.

We reject the null hypothesis.

11.9.2 State (but do not overstate) a contextually meaningful conclusion.

There is sufficient evidence that there is a sex bias in patients who suffer from malignant melanoma.

11.9.3 Express reservations or uncertainty about the generalizability of the conclusion.

We have no idea how these patients were sampled. Are these all the patients in Denmark with malignant melanoma over a certain period of time? Were they part of a convenience sample? As a result of our uncertainly about the sampling process, we can’t be sure if the results generalize to a larger population, either in Denmark or especially outside of Denmark.

Exercise 4

Can you find on the internet any evidence that females do indeed suffer from malignant melanoma more often than males (not just in Denmark, but anywhere)?

Please write up your answer here.

11.9.4 Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses.

As we rejected the null, we run the risk of making a Type I error. If we have made such an error, that would mean that patients with malignant melanoma are equally likely to be male or female, but that we got a sample with an unusual number of female patients.

11.10 Your turn

Past studies have shown that the presence of ulcerated tumors (tumors that break through the skin on top of the melanoma) are more dangerous, significantly increasing the risk of death from melanoma. There is some disagreement among studies, however, about the rate at which melanoma ulcerations occur. A review article from 2022 cited another study that estimated the ulceration rate to be 18%.2 But that number was based on a sample of patients suffering from melanoma of varying degrees of severity, not just patients diagnosed with malignant melanoma.

Determine if the percentage of patients in Denmark with malignant melanoma who also have an ulcerated tumor (measured with the ulcer variable) is significantly different from 18%. (Keep in mind that when you refer back to the example above, you will need to modify the null hypothesis value to reflect that it is now 18% and not 50%!)

As before, you have the outline of the rubric for inference below. Some of the steps will be the same or similar to steps in the example above. It is perfectly okay to copy and paste R code, making the necessary changes. It is not okay to copy and paste text. You need to put everything into your own words.

The template below is exactly the same as in the appendix (Rubric for inference) up to the part about confidence intervals which we haven’t learned yet.

Exploratory data analysis
Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.

Please write up your answer here.

# Add code here to print the data
# Add code here to glimpse the variables
Prepare the data for analysis.
# Add code here to prepare the data for analysis.

Please write up your answer here.

Make tables or plots to explore the data visually.
# Add code here to make tables or plots.
Hypotheses
Identify the sample (or samples) and a reasonable population (or populations) of interest.

Please write up your answer here.

Express the null and alternative hypotheses as contextually meaningful full sentences.

\(H_{0}:\) Null hypothesis goes here.

\(H_{A}:\) Alternative hypothesis goes here.

Express the null and alternative hypotheses in symbols (when possible).

\(H_{0}: math\)

\(H_{A}: math\)

Model
Identify the sampling distribution model.

Please write up your answer here.

Check the relevant conditions to ensure that model assumptions are met.

Please write up your answer here. (Some conditions may require R code as well.)

Mechanics
Compute the test statistic.
# Add code here to compute the test statistic.
Report the test statistic in context (when possible).

Please write up your answer here.

Plot the null distribution.
set.seed(42)
# Add code here to simulate the null distribution.
# Run 1000 simulations like in the earlier example.
# Add code here to plot the null distribution.
Calculate the P-value.
# Add code here to calculate the P-value.
# Don't forget to use P_2 instead.
Interpret the P-value as a probability given the null.

Please write up your answer here.

Conclusion
State the statistical conclusion.

Please write up your answer here.

State (but do not overstate) a contextually meaningful conclusion.

Please write up your answer here.

Express reservations or uncertainty about the generalizability of the conclusion.

Please write up your answer here.

Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses.

Please write up your answer here.

11.11 Conclusion

Now you have seen two fully-worked examples of hypothesis tests using randomization, and you have created two more examples on your own. Hopefully, the logic of inference and the process of running a formal hypothesis test are starting to make sense.

Keep in mind that the outline of steps will not change. However, the way each step is carried out will vary from problem to problem. Not only does the context change (one example involved sex discrimination, the other melanoma patients), but the statistics you compute also change (one example compared proportions from two samples and the other only had one proportion from a single sample). Pay close attention to the research question and the data that will be used to answer that question. That will be the only information you have to help you know which hypothesis test applies.

11.11.1 Preparing and submitting your assignment

  1. From the “Run” menu, select “Restart R and Run All Chunks”.
  2. Deal with any code errors that crop up. Repeat steps 1–2 until there are no more code errors.
  3. Spell check your document by clicking the icon with “ABC” and a check mark.
  4. Hit the “Render” button one last time to generate the final draft of the HTML file. (If there are errors here, you may need to go back and fix broken inline code or other markdown issues.)
  5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1–5 again.

If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.


  1. Most statisticians will use the term “statistically significant” instead of “statistically discernible.” The problem with “statistically significant” is that the word “significant” is often synonymous with “important.” An 11.5 % increase over 50% might be clinically important, or it might not be, depending on the context of the problem. All we can say from a hypothesis test is whether it is mathematically probable to see a number like 61.5% (or more extreme) if the null value of 50% were true. That mathematical fact does not say anything about whether the measurement is a clinically important one. With large enough sample sizes, even minor deviations from the null value will be statistically discernible.↩︎

  2. https://pmc.ncbi.nlm.nih.gov/articles/PMC9440663/↩︎