Glossary
Bivariate quantitative data
Data that consists of measurements for two quantitative variables for each individual or observation.
Example:
Collecting data on students' heights and their arm spans involves bivariate quantitative data.
Deterministic language
Language that implies a cause-and-effect relationship with absolute certainty, which is generally inappropriate in statistical inference due to inherent variability.
Example:
Stating 'Every time you study an extra hour, your exam score will increase by 5 points' is an example of deterministic language and should be avoided.
Equal Variance (condition)
A condition for linear regression inference stating that the variability of the residuals should be constant across all levels of the explanatory variable.
Example:
A fan-shaped pattern in a residual plot would indicate a violation of the equal variance condition.
Explanatory variable
The variable that is thought to explain or influence changes in the response variable, typically plotted on the x-axis.
Example:
In a study of study hours and exam scores, the number of hours studied would be the explanatory variable.
Independence (condition)
A condition for linear regression inference stating that the observations in the sample must be independent of each other.
Example:
When sampling students for a study, ensuring each student's data is collected without influencing others helps satisfy the independence condition.
Inference
The process of using sample data to make predictions or test claims about a larger population parameter.
Example:
A political pollster uses a sample of 1000 voters to make an inference about the proportion of all voters who support a particular candidate.
Least-squares regression line (equation)
The unique line that minimizes the sum of the squared vertical distances between the observed y-values and the predicted y-values.
Example:
The least-squares regression line for predicting house price from square footage might be .
Linearity (condition)
A condition for linear regression inference stating that the relationship between the explanatory and response variables must be approximately linear.
Example:
Before performing a regression analysis, we check the linearity condition by examining the scatterplot for a straight-line pattern.
Normality (condition)
A condition for linear regression inference stating that the residuals (errors) of the model must be approximately normally distributed.
Example:
We can check the normality condition by creating a normal probability plot of the residuals and looking for a roughly linear pattern.
Null hypothesis (for slope)
In a t-test for a slope, the null hypothesis typically states that the true population slope is zero, implying no linear relationship between the variables.
Example:
For a test on rainfall and corn yield, the null hypothesis would be that the true slope is 0, meaning rainfall has no linear effect on corn yield.
P-value
The probability of observing a test statistic as extreme as, or more extreme than, the one calculated from the sample data, assuming the null hypothesis is true.
Example:
A p-value of 0.01 means there's a 1% chance of seeing a slope as extreme as ours if there truly were no linear relationship.
Predictive language
Language used in statistics that acknowledges uncertainty and uses terms like 'predicted,' 'estimated,' or 'on average' when discussing relationships or outcomes.
Example:
Saying 'For every additional hour of study, the predicted exam score increases by 5 points' uses predictive language.
Response variable
The variable that measures the outcome of a study, which is thought to be affected by the explanatory variable and is plotted on the y-axis.
Example:
In a study of fertilizer amount and crop yield, the crop yield would be the response variable.
R² (coefficient of determination)
The proportion of the variation in the response variable that is explained by the linear regression model using the explanatory variable.
Example:
If R² is 0.75, it means 75% of the variability in corn yield can be explained by the amount of rainfall.
Scatterplots
Graphical displays used to visualize the relationship between two quantitative variables, with one on the x-axis and the other on the y-axis.
Example:
A scatterplot showing ice cream sales versus daily temperature might reveal a positive association.
Slope (of regression line)
The estimated change in the response variable for every one-unit increase in the explanatory variable, assuming a linear relationship.
Example:
A slope of 5 in the context of study hours and exam scores means that for every additional hour studied, the predicted exam score increases by 5 points.
Standard deviation of the residuals (s)
A measure of the typical distance or average size of the prediction errors (residuals) from the regression line.
Example:
A small standard deviation of the residuals (s) indicates that the actual data points are generally close to the regression line, meaning the model's predictions are precise.
Standard error of the slope
A measure of the variability of sample slopes around the true population slope, used in constructing confidence intervals and hypothesis tests for the slope.
Example:
A smaller standard error of the slope means our sample slope is likely a more precise estimate of the true population slope.
T-Interval for Slopes (Confidence Interval for Slope)
A range of plausible values for the true population slope, constructed using sample data and a specified confidence level.
Example:
A 95% T-Interval for Slopes of (0.8, 1.2) for a relationship between study hours and exam scores suggests we are 95% confident the true increase in score per hour of study is between 0.8 and 1.2 points.
T-Test for a Slope (Hypothesis Test for Slope)
A statistical test used to determine if there is a statistically significant linear relationship between two quantitative variables in the population.
Example:
A T-Test for a Slope might be used to determine if there's significant evidence that increased advertising spending leads to increased sales.
Test statistic (t-statistic)
A standardized value calculated from sample data that measures how many standard errors the sample slope is away from the hypothesized population slope (usually zero).
Example:
A large absolute test statistic (t-statistic) suggests that the observed sample slope is far from the hypothesized null value, providing strong evidence against the null hypothesis.
Y-intercept (of regression line)
The predicted value of the response variable when the explanatory variable is zero.
Example:
If the y-intercept for a regression of tree height (y) on age (x) is 2 feet, it means the predicted height of a newborn tree is 2 feet.
r (correlation coefficient)
A standardized measure that describes the strength and direction of a linear relationship between two quantitative variables, ranging from -1 to 1.
Example:
An r value of 0.92 suggests a strong positive linear relationship between hours spent exercising and calories burned.