Glossary
Bivariate Data
Data that involves two different variables, allowing for the analysis of relationships between them.
Example:
Studying the relationship between a student's hours of sleep and their test score is an example of analyzing bivariate data.
Categorical Data
Data that represents characteristics or qualities, often divided into categories or groups, rather than numerical measurements.
Example:
A survey asking about a person's favorite color (e.g., red, blue, green) or type of pet (e.g., dog, cat, fish) collects categorical data.
Computer Outputs
Statistical results and graphs generated by software, which AP Statistics students must be able to interpret rather than create from scratch.
Example:
On the AP exam, you'll often analyze a computer output to find the R-squared value and the regression equation for a dataset.
Conditional Relative Frequencies
The proportion of observations in a specific category of one variable, given a specific category of the other variable, calculated within a row or column.
Example:
The conditional relative frequency of 'liking chocolate given that a person is male' would be the number of males who like chocolate divided by the total number of males.
Correlation Coefficient (r)
A numerical measure that quantifies the strength and direction of a *linear* relationship between two quantitative variables, ranging from -1 to 1.
Example:
An r value of 0.92 indicates a strong, positive linear relationship, such as between daily temperatures and ice cream sales.
Correlation does not imply causation
A fundamental principle stating that just because two variables are related or move together, it does not mean one variable causes the other to change.
Example:
Finding a strong correlation between ice cream sales and drowning incidents doesn't mean ice cream causes drowning; a lurking variable like summer heat is likely the cause of both.
DUFS (Memory Aid)
An acronym used to remember the four key aspects to describe when interpreting a scatterplot: Direction, Unusual Features, Form, and Strength.
Example:
When describing the relationship between temperature and ice cream sales, remember to address the DUFS: positive direction, no unusual features, linear form, and strong strength.
Direction (of scatterplot)
Describes whether the relationship between two quantitative variables in a scatterplot is positive (upward slope), negative (downward slope), or no clear direction.
Example:
The direction of a scatterplot showing hours of TV watched and GPA would likely be negative.
Extrapolation
Using a regression line to predict values of the dependent variable for independent variable values that fall outside the range of the observed data.
Example:
Predicting the height of a 50-year-old based on a regression model built only from data of children aged 5-10 would be extrapolation and likely unreliable.
Form (of scatterplot)
Describes the overall shape or pattern of the relationship between two quantitative variables in a scatterplot, such as linear, curved, or no pattern.
Example:
A scatterplot showing age vs. reaction time might have a curved form, initially decreasing then increasing.
Influential Points
An outlier in a scatterplot that, if removed, would significantly change the slope or y-intercept of the least-squares regression line.
Example:
A single data point representing a very old car with extremely low mileage could be an influential point in a regression of age vs. value.
Joint Relative Frequencies
The proportion of observations that fall into a specific cell in a two-way table, calculated by dividing the cell count by the total number of observations.
Example:
In a table of gender vs. favorite sport, the joint relative frequency of 'females who prefer basketball' would be the count of such individuals divided by the total number of people surveyed.
Linear Regression (Least Squares Regression)
A statistical method used to find the line of best fit that minimizes the sum of the squared residuals, modeling the linear relationship between two quantitative variables.
Example:
Using linear regression to predict a student's final exam score based on their midterm score.
Marginal Relative Frequencies
The proportion of observations in each category of a single variable, found by dividing a row or column total by the grand total in a two-way table.
Example:
From a two-way table, the marginal relative frequency of 'students who prefer virtual learning' is the total count of virtual learners divided by the total number of students.
Mosaic Plots
A graphical display that visualizes the relationship between two categorical variables, where the width of the columns corresponds to the marginal distribution of one variable and the height of the segments within columns corresponds to the conditional distribution of the other.
Example:
A mosaic plot could illustrate the relationship between political affiliation and stance on a specific policy, with the area of each rectangle representing the proportion of observations.
Positive Correlation
A relationship between two quantitative variables where, as one variable increases, the other variable also tends to increase.
Example:
We'd expect a positive correlation between the number of hours spent exercising and overall fitness level.
Quantitative Data
Data that consists of numerical measurements or counts, where arithmetic operations like averaging make sense.
Example:
Recording the height in centimeters or the number of siblings for each student in a class involves quantitative data.
Residual Plots
A scatterplot of the residuals against the independent variable (x) or the predicted y-values, used to assess whether a linear model is appropriate.
Example:
A residual plot with a random scatter of points above and below zero suggests that a linear model is appropriate; a curved pattern suggests otherwise.
Residuals
The difference between the actual observed y-value and the y-value predicted by the regression line (actual - predicted).
Example:
If a student scored 85 on a test but the regression line predicted 80, their residual would be 85 - 80 = 5.
R² (Coefficient of Determination)
The proportion (or percentage) of the variation in the dependent variable (y) that can be explained by the linear relationship with the independent variable (x).
Example:
An R² of 0.75 means that 75% of the variation in exam scores can be explained by the number of hours studied.
Scatterplots
A graphical display used to show the relationship between two quantitative variables, with one variable plotted on the x-axis and the other on the y-axis.
Example:
Plotting a student's study hours on the x-axis against their exam score on the y-axis creates a scatterplot to visualize their relationship.
Segmented Bar Graphs
A graphical display that shows the proportion of each category within a group, with each bar representing a group and segmented to show the relative frequencies of categories.
Example:
A segmented bar graph could show the proportion of A's, B's, C's, etc. within each different class period.
Side-by-Side Bar Graphs
A graphical display used to compare the distributions of a categorical variable across different groups of another categorical variable.
Example:
Using a side-by-side bar graph to compare the distribution of favorite subjects for freshmen versus seniors.
Slope (of regression line)
In a linear regression equation (ŷ = a + bx), the slope (b) represents the predicted change in the dependent variable (y) for every one-unit increase in the independent variable (x).
Example:
If the slope of a regression line predicting test score from hours studied is 5, it means for every additional hour studied, the predicted test score increases by 5 points.
Strength (of scatterplot)
Describes how closely the points in a scatterplot follow a clear pattern or form, typically categorized as strong, moderate, or weak.
Example:
A scatterplot of height vs. shoe size would likely show a strong strength of association.
Transforming Data Sets
Applying mathematical functions (like logarithms or square roots) to one or both variables in a dataset to make a non-linear relationship appear more linear, allowing for linear regression.
Example:
If a scatterplot of population growth over time looks curved, transforming the population data using a logarithm might make the relationship linear.
Two-Way Tables
A table used to display the relationship between two categorical variables by showing the counts or frequencies of observations for each combination of categories.
Example:
A two-way table could show the number of students who prefer online learning versus in-person learning broken down by grade level.
Unusual Features (of scatterplot)
Any deviations from the overall pattern in a scatterplot, including outliers (points far from the main cluster), clusters (distinct groupings), or gaps.
Example:
A scatterplot of income vs. education level might show an unusual feature like an outlier representing someone with very little education but extremely high income.
Y-intercept (of regression line)
In a linear regression equation (ŷ = a + bx), the y-intercept (a) represents the predicted value of the dependent variable (y) when the independent variable (x) is zero.
Example:
If the y-intercept of a regression line predicting plant height from days of growth is 2 cm, it means the predicted plant height at zero days of growth is 2 cm.
s (Standard Deviation of the Residuals)
A measure of the typical size of the residuals, indicating the average distance between the observed y-values and the regression line.
Example:
An s of 3 points means that, on average, the actual test scores differ from the predicted scores by about 3 points.