Glossary
Categorical Data
Data that represents characteristics or qualities, which can be divided into categories or groups, often without a natural order.
Example:
The types of cars people drive (sedan, SUV, truck) or their favorite ice cream flavors (vanilla, chocolate, strawberry) are examples of categorical data.
Chi-Squared Tests
A family of statistical tests used to analyze categorical data, typically to determine if observed frequencies differ significantly from expected frequencies or if there's an association between categorical variables.
Example:
An AP Stats student might use a chi-squared test to see if there's a relationship between a person's favorite color and their preferred music genre.
Chi-squared test statistic (χ²)
A calculated value that measures the discrepancy between the observed frequencies and the expected frequencies in a chi-squared test.
Example:
A large chi-squared test statistic value suggests a significant difference between what was observed and what was expected, leading to rejection of the null hypothesis.
Conditions for Inference (Chi-Squared)
Assumptions that must be met for the results of a chi-squared test to be valid, including random sampling, independence of observations, and large expected counts (typically ≥ 5 in each cell).
Example:
Before conducting a chi-squared test, a student must verify the conditions for inference, such as ensuring all expected counts are at least 5.
Degrees of Freedom (df)
A value that indicates the number of independent pieces of information used to calculate a statistic, influencing the shape of the chi-squared distribution.
Example:
In a chi-squared test for independence with a 2x3 table, the degrees of freedom would be (2-1)*(3-1) = 2.
Expected Counts
The frequencies that would be anticipated in each cell of a contingency table if the null hypothesis were true (i.e., if there were no association or difference).
Example:
When testing if a die is fair, the expected counts for each face would be 1/6th of the total rolls.
Goodness of Fit Test
A chi-squared test used to determine if an observed frequency distribution for a single categorical variable matches an expected or theoretical distribution.
Example:
A candy company might use a Goodness of Fit Test to see if the color distribution of candies in their new batch matches the advertised proportions.
Hypotheses (Null and Alternative)
The null hypothesis (H0) states there is no effect or no difference, while the alternative hypothesis (Ha) states there is an effect or a difference.
Example:
For a study on a new drug, the null hypothesis might be that the drug has no effect, while the alternative hypothesis is that it reduces symptoms.
P-value
The probability of observing a test statistic as extreme as, or more extreme than, the one calculated, assuming the null hypothesis is true.
Example:
If a p-value is 0.01, it means there's a 1% chance of seeing the observed results if the null hypothesis were actually true.
Significance Level (α)
A pre-determined threshold (e.g., 0.05) used to decide whether to reject the null hypothesis; if the p-value is less than α, the null hypothesis is rejected.
Example:
Setting the significance level at 0.05 means you are willing to accept a 5% chance of making a Type I error (rejecting a true null hypothesis).
Test for Homogeneity
A chi-squared test used to determine if the distribution of a single categorical variable is the same across two or more different populations or groups.
Example:
A marketing team might use a Test for Homogeneity to compare if the distribution of customer satisfaction ratings is the same across three different store locations.
Test for Independence
A chi-squared test used to determine if there is a statistically significant association between two categorical variables from a single sample.
Example:
Researchers could use a Test for Independence to investigate if there's a relationship between a student's chosen major and their participation in extracurricular activities.
Two-way table
A table that displays the counts of observations for two categorical variables, with rows representing categories of one variable and columns representing categories of the other.
Example:
A survey collecting data on gender and preferred social media platform would typically summarize its findings in a two-way table.