18 Chi-square test for independence
2.0
18.1 Introduction
In this chapter we will learn how to run the chi-square test for independence.
A chi-square test for independence tests the relationship between two categorical variables. This is an extension of the test for two proportions, except now applied in situations where either the predictor or response variables (or both) have three or more categories.
18.1.2 Download the R notebook file
Check the upper-right corner in RStudio to make sure you’re in your intro_stats
project. Then click on the following link to download this chapter as an R notebook file (.Rmd
).
https://vectorposse.github.io/intro_stats/chapter_downloads/18-chi_square_test_for_independence.Rmd
Once the file is downloaded, move it to your project folder in RStudio and open it there.
18.2 Load packages
We load the standard tideverse
, janitor
, and infer
packages. We also use the MASS
package for the birthwt
data, and the openintro
package for the smoking
data.
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
##
## Attaching package: 'openintro'
##
## The following objects are masked from 'package:MASS':
##
## housing, mammals
18.3 Research question
Are mothers from certain racial groups more or less likely to have low birth weight babies? In other words, are low birth weight and race associated?
Let’s look at the data. The birthwt
data was collected at Baystate Medical Center, Springfield, Mass during 1986. In terms of addressing the research question, we are, of course, limited to conclusions about women in that area of the country in the mid-1980s.
birthwt
glimpse(birthwt)
## Rows: 189
## Columns: 10
## $ low <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ age <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18, …
## $ lwt <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95, 1…
## $ race <int> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3, 1…
## $ smoke <int> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0…
## $ ptl <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ht <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ui <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1…
## $ ftv <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2, 3…
## $ bwt <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 2722…
The low
variable is an indicator of birth weight less than 2.5 kg. So even though birth weight is numerical, we have a convenient categorical variable that serves as a marker of low birth weight, gathering all low birth weight babies into a single group. The race
variable is categorical, coded as 1 = white, 2 = black, 3 = other.
Neither variable appears in the data frame as a factor variable, so we will need to change that. The new tibble will be called birthwt2
.
birthwt2 <- birthwt %>%
mutate(low_fct = factor(low, levels = c(0, 1),
labels = c("no", "yes")),
race_fct = factor(race, levels = c(1, 2, 3),
labels = c("white", "black", "other")))
birthwt2
glimpse(birthwt2)
## Rows: 189
## Columns: 12
## $ low <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ age <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 1…
## $ lwt <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95…
## $ race <int> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3…
## $ smoke <int> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0…
## $ ptl <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ ht <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ui <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0…
## $ ftv <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2…
## $ bwt <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 2…
## $ low_fct <fct> no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, n…
## $ race_fct <fct> black, other, white, white, white, other, white, other, white…
18.4 Chi-square test for independence
In a previous chapter, we learned about the chi-square goodness-of-fit test. With a single categorical variable, we summarized data in a frequency table. Each cell of the table had an observed count from the data that we compared to an expected count from the assumption of a null hypothesis. The chi-square statistic measured the discrepancy between observed and expected.
With two categorical variables, we use a contingency table instead of a frequency table. But the principle of the chi-square statistic is the same: each cell in the contingency table has an observed count and an expected count. This forms the basis of a chi-square test for independence.
Below is the contingency table for these two variables. Normally, we only care about column totals because we care how the response variable (here, low_fct
) is distributed in each group of the predictor variable (i.e., each racial group). But for the calculation of chi-squared, we will need both row and column totals.
tabyl(birthwt2, low_fct, race_fct) %>%
adorn_totals(where = c("row", "col"))
A test for independence has a simple null hypothesis: the two variables are independent. This gives us a way to compute expected counts. To see how, look at the sum of all the normal weight babies (\(73 + 15 + 42 = 130\)) and all the low birth weight babies (\(23 + 11 + 25 = 59\)). In other words, if race is ignored, there were 130 normal weight babies and 59 low birth weight babies out of 189 total babies. 59 of 189 is 0.31217 or 31.217%, and 130 of 189 is 0.68783 or 68.783%.
Now, if low birth weight and race are truly independent, it shouldn’t matter if the mothers were white, black, or some other race. In other words, of 96 white mothers, we should still expect 68.783% of them to have normal weight babies and 31.217% of them to have low birth weight babies. 68.783% of 96 is 66.032. This is the expected cell count for normal birth weight babies of white women. 31.217% of 96 is 29.968. This is the expected cell count for low birth weight babies of white women. The same analysis can be done for the next two columns as well.
Exercise 1
Complete the list of expected cell counts in the table above. In other words, apply the percentages 68.783% and 31.217% to the totals of the “black” and “other” columns. Put them in the table below:
white | black | other | |
---|---|---|---|
no | 66.032 | ? | ? |
yes | 29.968 | ? | ? |
Unlike the goodness-of-fit test that requires one to specify expected counts for each cell, the test for independence uses only the data to determine the expected counts. For any given cell, if \(R\) is the row total, \(C\) is the column total, and \(n\) is the grand total (the sample size), the expected count in any cell is simply
\[ E = \frac{R C}{n}. \]
This is equivalent to the explanation in the previous paragraph. Using low birth weight babies among white mothers as an example, \(R/n\) is \(59/189\) which is 0.31217. Then we multiply this by the column total \(C = 96\) to get
\[ \left(\frac{R}{n}\right) C = \frac{R C}{n} = \frac{59 \times 96}{189} = 29.96825. \]
Everything else works almost the same as it did for a chi-square goodness-of-fit test. We still compute \(\chi^{2}\) by adding up deviations across all cells:
\[ \chi^{2} = \sum \frac{(O - E)^{2}}{E}. \]
Even under the assumption of the null, there will still be some sampling variability. Like any hypothesis test, our job is to determine whether the deviations we see are possible due to pure chance alone. The random values of \(\chi^{2}\) that result from sampling variability will follow a chi-square model. But how many degrees of freedom are there? This is a little different from the goodness-of-fit test. Instead of the number of cells minus one, we use the following formula:
\[ df = (\#rows - 1)(\#columns - 1). \]
In our example we have 2 rows (“yes”, “no”) and 3 columns (“white”, “black”, “other”); therefore,
\[ df = (2 - 1)(3 - 1) = 1 \times 2 = 2 \]
and we have 2 degrees of freedom (even though there are 6 cells).
Let’s run through the rubric in its entirety.
18.5 Exploratory data analysis
18.5.1 Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.
You should type ?birthwt
at the Console to read the help file. We don’t have any information about how these mothers were selected. The “Source” at the end of the help file is a statistics textbook, so we’d have to track down that book to see where they got the data and if traced back to a primary source.
birthwt
glimpse(birthwt)
## Rows: 189
## Columns: 10
## $ low <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ age <int> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18, …
## $ lwt <int> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95, 1…
## $ race <int> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3, 1…
## $ smoke <int> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0…
## $ ptl <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ht <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ ui <int> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1…
## $ ftv <int> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2, 3…
## $ bwt <int> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 2722…
18.5.2 Prepare the data for analysis.
# Although we've already done this above,
# we include it here again for completeness.
birthwt2 <- birthwt %>%
mutate(low_fct = factor(low, levels = c(0, 1),
labels = c("no", "yes")),
race_fct = factor(race, levels = c(1, 2, 3),
labels = c("white", "black", "other")))
birthwt2
18.5.3 Make tables or plots to explore the data visually.
tabyl(birthwt2, low_fct, race_fct) %>%
adorn_totals()
tabyl(birthwt2, low_fct, race_fct) %>%
adorn_totals() %>%
adorn_percentages("col") %>%
adorn_pct_formatting()
Commentary: Earlier we used row and column total to explain how expected cell counts arise. Here, however, we will revert back to our previous standard practice of generating one contingency table with counts and another with column percentages.
18.6 Hypotheses
18.6.1 Identify the sample (or samples) and a reasonable population (or populations) of interest.
The sample consists of 189 mothers who gave birth at the Baystate Medical Center in Springfield, Massachusetts in 1986. The population is presumably all mothers, although it’s safest to conclude only about mothers who gave birth at this hospital.
18.7 Model
18.7.1 Identify the sampling distribution model.
We will use a chi-square model with 2 degrees of freedom.
18.7.2 Check the relevant conditions to ensure that model assumptions are met.
- Random
- We hope that these 189 women are representative of all women who gave birth in this hospital (or, at best, in that region) around that time.
- 10%
- We don’t know how many women gave birth at this hospital, but perhaps over many years we might have more than 1890 women.
- Expected cell counts
- You checked the cell counts as a part of Exercise 1. Note that all expected cell counts are larger than 5, so the condition is met.
18.8 Mechanics
18.8.1 Compute the test statistic.
obs_chisq <- birthwt2 %>%
specify(response = low_fct, explanatory = race_fct) %>%
hypothesize(null = "independence") %>%
calculate(stat = "chisq")
obs_chisq
18.8.2 Report the test statistic in context (when possible).
The value of \(\chi^{2}\) is 5.004813.
Commentary: As in the last chapter, there’s not much context to report with a value of \(\chi^{2}\), so the most we can do here is just report it in a full sentence.
18.8.3 Plot the null distribution.
low_race_test <- birthwt2 %>%
specify(response = low_fct, explanatory = race_fct) %>%
assume(distribution = "chisq")
low_race_test
## A Chi-squared distribution with 2 degrees of freedom.
low_race_test %>%
visualize() +
shade_p_value(obs_chisq, direction = "greater")
18.8.4 Calculate the P-value.
low_race_test_p <- low_race_test %>%
get_p_value(obs_chisq, direction = "greater")
low_race_test_p
18.9 Conclusion
18.9.2 State (but do not overstate) a contextually meaningful conclusion.
There is insufficient evidence that low birth weight and race are associated.
18.9.3 Express reservations or uncertainty about the generalizability of the conclusion.
Given our uncertainly about how the data was collected, it’s not clear what our conclusion means. Also, failing to reject the null is really a “non-conclusion” in that it leaves us basically knowing nothing. We don’t have evidence of such an association (and there are good reasons to believe there may not be one), but failing to reject the null does not prove anything.
18.9.4 Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses.
It’s possible that we have made a Type II error. It may be that low birth weight and race are associated, but our sample has not given enough evidence of such an association.
18.10 Confidence interval
There are no parameters of interest in a chi-square test, so there is no confidence interval to report.
18.11 Your turn
Use the smoking
data set from the openintro
package. Run a chi-square test for independence to determine if smoking status is associated with marital status.
The rubric outline is reproduced below. You may refer to the worked example above and modify it accordingly. Remember to strip out all the commentary. That is just exposition for your benefit in understanding the steps, but is not meant to form part of the formal inference process.
Another word of warning: the copy/paste process is not a substitute for your brain. You will often need to modify more than just the names of the data frames and variables to adapt the worked examples to your own work. Do not blindly copy and paste code without understanding what it does. And you should never copy and paste text. All the sentences and paragraphs you write are expressions of your own analysis. They must reflect your own understanding of the inferential process.
Also, so that your answers here don’t mess up the code chunks above, use new variable names everywhere.
Exploratory data analysis
Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.
Please write up your answer here
# Add code here to print the data
# Add code here to glimpse the variables
Hypotheses
Identify the sample (or samples) and a reasonable population (or populations) of interest.
Please write up your answer here.
18.12 Bonus section: Residuals
Just like with the chi-square test for goodness of fit, rejecting the null hypothesis using the chi-square test for independence informs us that two variables are associated, but it doesn’t tell us the useful information about which combinations of variables have higher and lower counts than expected. And just like the chi-square test for goodness of fit, we can examine the residuals table to find that information.
A word of caution: You should only examine the residuals if your test was statistically significant! The residuals table for tests in which we fail to reject the null hypothesis can be misleading.
Because we failed to reject the null hypothesis in the low_race_test
, it would be unwise for us to examine the residuals table in that test. Instead, we’ll use a different example.
The diabetes2
dataset in the openintro
package contains information about an experiment evaluating three treatments for Type 2 diabetes in patients aged 10-17 who were being treated with metformin. The three treatments summarized in the treatment
variable were: continued treatment with metformin (met
), treatment with metformin combined with rosiglitazone (rosi
), or a lifestyle intervention program (lifestyle
). Each patient had a primary outcome
, which was either “lacked glycemic control” (failure
) or did not lack that control (success
). Here is the summary of the results of the experiment:
tabyl(diabetes2, treatment, outcome)
For the sake of a streamlined presentation, we’ll omit the usual details of condition-checking, hypothesis-writing, etc., and skip right to the conclusion.
tabyl(diabetes2, treatment, outcome) %>%
chisq.test() -> outcome_treatment_chisq.test
outcome_treatment_chisq.test
##
## Pearson's Chi-squared test
##
## data: .
## X-squared = 8.1645, df = 2, p-value = 0.01687
Notice that the p-value obtained from the test is below our usual significance level \(\alpha = 0.05\), so it makes sense for us to examine the residuals.
outcome_treatment_chisq.test$residuals
Again, these values don’t mean much in the real world; our job is to look at the most positive and most negative values.
- Since the
rosi
andfailure
cell has the most negative value, the count of people who failed to achieve glycemic control with rosiglitazone is the most below expected. (That’s a good result!) - Since the
rosi
andsuccess
cell has the most positive value, the count of people who succeeded in achieving glycemic control with rosiglitazone is the most above expected. (That’s also a good result!)
Overall, we can conclude that the rosiglitazone treatment was quite successful in helping people achieve their glycemic control goals.
18.13 Conclusion
With two categorical variables, we can run a chi-square test for independence to test the null hypothesis that the two variables are independent. While technically we can run this test for any two categorical variables, if both variables have only two levels, we would usually choose to run a test for two proportions. The chi-square test for independence is useful when one or both of the response and predictor variables have three or more levels. The expected cell counts are derived from the data and then the chi-squared statistic is computed as usual. Using the correct degrees of freedom, we can test how much the observed cell counts deviate from the expected cell counts and derive a P-value.
18.13.1 Preparing and submitting your assignment
- From the “Run” menu, select “Restart R and Run All Chunks”.
- Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors.
- Spell check your document by clicking the icon with “ABC” and a check mark.
- Hit the “Preview” button one last time to generate the final draft of the
.nb.html
file. - Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1–5 again.
If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.