case study using linear regression

Skip to secondary menu
Skip to main content
Skip to primary sidebar

Statistics By Jim

Making statistics intuitive

Linear Regression Explained with Examples

By Jim Frost 13 Comments

What is Linear Regression?

Linear regression models the relationships between at least one explanatory variable and an outcome variable. This flexible analysis allows you to separate the effects of complicated research questions, allowing you to isolate each variable’s role. Additionally, linear models can fit curvature and interaction effects.

Statisticians refer to the explanatory variables in linear regression as independent variables (IV) and the outcome as dependent variables (DV). When a linear model has one IV, the procedure is known as simple linear regression. When there are more than one IV, statisticians refer to it as multiple regression. These models assume that the average value of the dependent variable depends on a linear function of the independent variables.

Linear regression has two primary purposes—understanding the relationships between variables and prediction.

The coefficients represent the estimated magnitude and direction (positive/negative) of the relationship between each independent variable and the dependent variable.
The equation allows you to predict the mean value of the dependent variable given the values of the independent variables that you specify.

Linear regression finds the constant and coefficient values for the IVs for a line that best fit your sample data. The graph below shows the best linear fit for the height and weight data points, revealing the mathematical relationship between them. Additionally, you can use the line’s equation to predict future values of the weight given a person’s height.

Linear regression was one of the earliest types of regression analysis to be rigorously studied and widely applied in real-world scenarios. This popularity stems from the relative ease of fitting linear models to data and the straightforward nature of analyzing the statistical properties of these models. Unlike more complex models that relate to their parameters in a non-linear way, linear models simplify both the estimation and the interpretation of data.

In this post, you’ll learn how to interprete linear regression with an example, about the linear formula, how it finds the coefficient estimates , and its assumptions .

Learn more about when you should use regression analysis and independent and dependent variables .

Linear Regression Example

Suppose we use linear regression to model how the outside temperature in Celsius and Insulation thickness in centimeters, our two independent variables, relate to air conditioning costs in dollars (dependent variable).

Let’s interpret the results for the following multiple linear regression equation:

Air Conditioning Costs$ = 2 * Temperature C – 1.5 * Insulation CM

The coefficient sign for Temperature is positive (+2), which indicates a positive relationship between Temperature and Costs. As the temperature increases, so does air condition costs. More specifically, the coefficient value of 2 indicates that for every 1 C increase, the average air conditioning cost increases by two dollars.

On the other hand, the negative coefficient for insulation (–1.5) represents a negative relationship between insulation and air conditioning costs. As insulation thickness increases, air conditioning costs decrease. For every 1 CM increase, the average air conditioning cost drops by $1.50.

We can also enter values for temperature and insulation into this linear regression equation to predict the mean air conditioning cost.

Learn more about interpreting regression coefficients and using regression to make predictions .

Linear Regression Formula

Linear regression refers to the form of the regression equations these models use. These models follow a particular formula arrangement that requires all terms to be one of the following:

The constant
A parameter multiplied by an independent variable (IV)

Then, you build the linear regression formula by adding the terms together. These rules limit the form to just one type:

Dependent variable = constant + parameter * IV + … + parameter * IV

This formula is linear in the parameters. However, despite the name linear regression, it can model curvature. While the formula must be linear in the parameters, you can raise an independent variable by an exponent to model curvature . For example, if you square an independent variable, linear regression can fit a U-shaped curve.

Specifying the correct linear model requires balancing subject-area knowledge, statistical results, and satisfying the assumptions.

Learn more about the difference between linear and nonlinear models and specifying the correct regression model .

How to Find the Linear Regression Line

Linear regression can use various estimation methods to find the best-fitting line. However, analysts use the least squares most frequently because it is the most precise prediction method that doesn’t systematically overestimate or underestimate the correct values when you can satisfy all its assumptions.

The beauty of the least squares method is its simplicity and efficiency. The calculations required to find the best-fitting line are straightforward, making it accessible even for beginners and widely used in various statistical applications. Here’s how it works:

Objective : Minimize the differences between the observed and the linear regression model’s predicted values . These differences are known as “ residuals ” and represent the errors in the model values.
Minimizing Errors : This method focuses on making the sum of these squared differences as small as possible.
Best-Fitting Line : By finding the values of the model parameters that achieve this minimum sum, the least squares method effectively determines the best-fitting line through the data points.

By employing the least squares method in linear regression and checking the assumptions in the next section, you can ensure that your model is as precise and unbiased as possible. This method’s ability to minimize errors and find the best-fitting line is a valuable asset in statistical analysis.

Assumptions

Linear regression using the least squares method has the following assumptions:

A linear model satisfactorily fits the relationship.
The residuals follow a normal distribution.
The residuals have a constant scatter.
Independent observations.
The IVs are not perfectly correlated.

Residuals are the difference between the observed value and the mean value that the model predicts for that observation. If you fail to satisfy the assumptions, the results might not be valid.

Learn more about the assumptions for ordinary least squares and How to Assess Residual Plots .

Yan, Xin (2009), Linear Regression Analysis: Theory and Computing

Reader Interactions

May 9, 2024 at 9:10 am

Why not perform centering or standardization with all linear regression to arrive at a better estimate of the y-intercept?

May 9, 2024 at 4:48 pm

I talk about centering elsewhere. This article just covers the basics of what linear regression does.

A little statistical niggle on centering creating a “better estimate” of the y-intercept. In statistics, there’s a specific meaning to “better estimate,” relating to precision and a lack of bias. Centering (or standardizing) doesn’t create a better estimate in that sense. It can create a more interpretable value in some situations, which is better in common usage.

August 16, 2023 at 5:10 pm

Hi Jim, I’m trying to understand why the Beta and significance changes in a linear regression, when I add another independent variable to the model. I am currently working on a mediation analysis, and as you know the linear regression is part of that. A simple linear regression between the IV (X) and the DV (Y) returns a statistically significant result. But when I add another IV (M), X becomes insignificant. Can you explain this? Seeking some clarity, Peta.

August 16, 2023 at 11:12 pm

This is a common occurrence in linear regression and is crucial for mediation analysis.

By adding M (mediator), it might be capturing some of the variance that was initially attributed to X. If M is a mediator, it means the effect of X on Y is being channeled through M. So when M is included in the model, it’s possible that the direct effect of X on Y becomes weaker or even insignificant, while the indirect effect (through M) becomes significant.

If X and M share variance in predicting Y, when both are in the model, they might “compete” for explaining the variance in Y. This can lead to a situation where the significance of X drops when M is added.

I hope that helps!

July 31, 2022 at 7:56 pm

July 30, 2022 at 2:49 pm

Jim, Hi! I am working on an interpretation of multiple linear regression. I am having a bit of trouble getting help. is there a way to post the table so that I may initiate a coherent discussion on my interpretation?

April 28, 2022 at 3:24 pm

Is it possible that we get significant correlations but no significant prediction in a multiple regression analysis? I am seeing that with my data and I am so confused. Could mediation be a factor (i.e IVs are not predicting the outcome variables because the relationship is made possible through mediators)?

April 29, 2022 at 4:37 pm

I’m not sure what you mean by “significant prediction.” Typically, the predictions you obtain from regression analysis will be a fitted value (the prediction) and a prediction interval that indicates the precision of the prediction (how close is it likely to be to the correct value). We don’t usually refer to “significance” when talking about predictions. Can you explain what you mean? Thanks!

March 25, 2022 at 7:19 am

I want to do a multiple regression analysis is SPSS (creating a predictive model), where IQ is my dependent variable and my independent variables contains of different cognitive domains. The IQ scores are already scaled for age. How can I controlling my independent variables for age, whitout doing it again for the IQ scores? I can’t add age as an independent variable in the model.

I hope that you can give me some advise, thank you so much!

March 28, 2022 at 9:27 pm

If you include age as an independent variable, the model controls for it while calculating the effects of the other IVs. And don’t worry, including age as an IV won’t double count it for IQ because that is your DV.

March 2, 2022 at 8:23 am

Hi Jim, Is there a reason you would want your covariates to be associated with your independent variable before including them in the model? So in deciding which covariates to include in the model, it was specified that covariates associated with both the dependent variable and independent variable at p<0.10 will be included in the model.

My question is why would you want the covariates to be associated with the independent variable?

March 2, 2022 at 4:38 pm

In some cases, it’s absolutely crucial to include covariates that correlate with other independent variables, although it’s not a sufficient reason by itself. When you have a potential independent variable that correlates with other IVs and it also correlates with the dependent variable, it becomes a confounding variable and omitting it from the model can cause a bias in the variables that you do include. In this scenario, the degree of bias depends on the strengths of the correlations involved. Observational studies are particularly susceptible to this type of omitted variable bias. However, when you’re performing a true, randomized experiment, this type of bias becomes a non-issue.

I’ve never heard of a formalized rule such as the one that you mention. Personally, I wouldn’t use p-values to make this determination. You can have low p-values for weak correlation in some cases. Instead, I’d look at the strength of the correlations between IVs. However, it’s not a simple as a single criterial like that. The strength of the correlation between the potential IV and the DV also plays a role.

I’ve written an article about that discusses these issues in more detail, read Confounding Variables Can Bias Your Results .

February 28, 2022 at 8:19 am

Jim, as if by serendipity: having been on your mailing list for years, I looked up your information on multiple regression this weekend for a grad school advanced statistics case study. I’m a fan of your admirable gift to make complicated topics approachable and digestible. Specifically, I was looking for information on how pronounced the triangular/funnel shape must be–and in what directions it may point–to suggest heteroscedasticity in a regression scatterplot of standardized residuals vs standardized predicted values. It seemed to me that my resulting plot of a 5 predictor variable regression model featured an obtuse triangular left point that violated homoscedasticity; my professors disagreed, stating the triangular “funnel” aspect would be more prominent and overt. Thus, should you be looking for a new future discussion point, my query to you then might be some pearls on the nature of a qualifying heteroscedastic funnel shape: How severe must it be? Is there a quantifiable magnitude to said severity, and if so, how would one quantify this and/or what numeric outputs in common statistical software would best support or deny a suspicion based on graphical interpretation? What directions can the funnel point; are only some directions suggestive, whereby others are not? Thanks for entertaining my comment, and, as always, thanks for doing what you do.

Comments and Questions Cancel reply

Data Science: Linear Regression

Implement linear regression and adjust for confounding in practice using r..

Learn how to use R to implement linear regression, one of the most common statistical modeling approaches in data science.

What You'll Learn

Linear regression is commonly used to quantify the relationship between two or more variables. It is also used to adjust for confounding. This course, part of our Professional Certificate Program in Data Science , covers how to implement linear regression and adjust for confounding in practice using R.

In data science applications, it is very common to be interested in the relationship between two or more variables. The motivating case study we examine in this course relates to the data-driven approach used to construct baseball teams described in Moneyball. We will try to determine which measured outcomes best predict baseball runs by using linear regression.

We will also examine confounding, where extraneous variables affect the relationship between two or more other variables, leading to spurious associations. Linear regression is a powerful technique for removing confounders, but it is not a magical process. It is essential to understand when it is appropriate to use, and this course will teach you when to apply this technique.

The course will be delivered via edX and connect learners around the world. By the end of the course, participants will understand the following concepts:

How linear regression was originally developed by Galton
What is confounding and how to detect it
How to examine the relationships between variables by implementing linear regression in R

Your Instructors

Rafael Irizarry

Professor of Biostatistics at Harvard University Read full bio.

Ways to take this course

When you enroll in this course, you will have the option of pursuing a Verified Certificate or Auditing the Course.

A Verified Certificate costs $149 and provides unlimited access to full course materials, activities, tests, and forums. At the end of the course, learners who earn a passing grade can receive a certificate.

Alternatively, learners can Audit the course for free and have access to select course material, activities, tests, and forums. Please note that this track does not offer a certificate for learners who earn a passing grade.

Introduction to Linear Models and Matrix Algebra

Learn to use R programming to apply linear models to analyze data in life sciences.

Data Science: Inference and Modeling

Learn inference and modeling: two of the most widely used statistical tools in data analysis.

Data Science: Capstone

Show what you’ve learned from the Professional Certificate Program in Data Science.

Linear Regression In Real Life

Real world problems solved with math.

Carolina Bento

Towards Data Science

We learn a lot of interesting and useful concepts in school but sometimes it's not very clear how we can use them in real life.

One concept/tool that might be widely underestimated is Linear Regression .

Say you’re planning a road trip to Las Vegas with two of your best friends. You start off in San Francisco and you know it’s going to be a ~9h drive. While your friends are in charge of the party operations you’re in charge of the all the logistics involved. You have to plan every detail: the schedule, when to stop and where, make sure you get there on time ...

So, what’s the first thing you do? You sneakily disappear from the face of the Earth and stop answering your friends calls, because they’ll have fun while you’ll be the party police?! No, you get yourself a blank sheet of paper and start planning!

First item on your checklist? Budget ! It’s a 9h — approximately 1200 miles— fun ride, so a total of 18h on the road. The follow-up question: How much money should I allocate for gas?

This is a very important question. You don't want to stop in the middle of the highway and possibly walk a few miles just because you ran out of gas!

How much money should you allocate for gas?

You approach this problem with a science-oriented mindset, thinking that there must be a way to estimate the amount of money needed, based on the distance you're travelling.

First, you look at some data.

You've been laboriously tracking your car’s efficiency for the last year — because who doesn’t! — so somewhere in your computer there's this spreadsheet

At this point these are just numbers. It's not very easy to get any valuable information from this spreadsheet.

However, plotted like this it's clear that there is some "connection" between how far you can drive without filling the tank. Not that you didn't know that already but now — with data — it becomes clear.

What you really want to figure out is

"If I drive for 1200 miles, how much will I pay for gas?"

In order to answer this question, you'll use the data you've been collecting so far, and use it to predict how much you are going to spend. The idea is that you can make estimated guesses about the future — your trip to Vegas — based on data from the past — the data points you've been laboriously logging.

You end up with a mathematical model that describes the relationship between miles driven and money spent to fill the tank.

Once that model is defined, you can provide it with new information — how many miles you're driving from San Francisco to Las Vegas — and the model will predict how much money you're going to need.

The model will use data from the past to learn what's the relationship between the total of miles driven and the total amount paid for gas.

When presented it with a new data point, how many miles you drove from San Francisco to Las Vegas, the model will leverage on the knowledge it got from all the past data and provide its best guess — a prediction, i.e, your data point from the future .

Looking back at your data you see that usually, the more you spend on gas, the longer you can drive before running dry — assuming that the price of gas stays constant.

If you were to best describe — or "explain" — that relationship in the plot above, it would look somewhat like this

Clearly there is a linear relationship between miles driven and total paid for gas. Because this relationship is linear, if you spend less/more money — e.g. half vs full tank — you'll be able to drive fewer/more miles.

And because that relationship is linear and you know how long is your drive from San Francisco to Las Vegas, using a linear model will help you predict how much you are going to budget for gas.

Linear Regression Model

The type of model that best describes the relationship between total miles driven and total paid for gas is a Linear Regression Model . The regression bit is there, because what you're trying to predict is a numerical value.

There are a few concepts to unpack here:

Dependent Variable
Independent Variable(s)
Coefficients

The amount of money you'll have to budget for gas depends on how many miles you are going to drive, in this case, to go from San Francisco to Las Vegas. Thus, the total paid for gas is the dependent variable in the model.

On the other hand, Las Vegas isn't going anywhere, so how many miles you need to drive from San Francisco to Las Vegas is independent of the amount you pay at the gas station — miles driven is the independent variable in the model. Let's for a moment assume that price of gas remains constant.

Since we're only dealing with one independent variable, the model can be specified as:

This is a simple version of a linear combination , where there's only one variable. If you wanted to be more rigorous in your calculations you could also add the price of the oil barrel as an independent variable in this model, since it affects the price of gas.

With all the necessary pieces of the model in place, the only question that remains is: w hat about B0, B1 and B2?

B0, read "Beta 0", is the intercept of the model, meaning it's the value that your independent variable takes if every dependent variable equals to zero. You can visualize it as a straight line that goes through the origin of the axis.

"Beta 1" and "Beta 2" are the called coefficients . You have one coefficient per each independent variable in your model. They determine the slope of your regression line, the line that describes your model.

If we take the example above, a model specified by y= Beta0 + Beta1x, and play around with different Beta 1 values, we have something like

The coefficients explain the rate of change of the dependent variable, the amount you'll pay in gas, as each independent variable change by one unit.

So, in the case of the blue line above, the dependent value y changes by a factor of 1, every time the independent variable x changes a unit . Thus, for the green line, that effect is 4 times a unit change in the dependent variable x.

Ordinary Least Squares

At this point we've discussed the Linear model and even experimenting plugging in different values for both the intercept and the coefficient.

However, to figure out how much you're going to pay for gas on you trip to Las Vegas we need a mechanism to estimate those values.

There are different techniques to estimate the parameters of a model. One of the most popular is the Ordinary Least Squares (OLS) .

The premise of the Ordinary Least Squares method is to minimize the sum of the squares of the residuals of the model. Which are the difference, think distance, between the predicted values and the actual values in the dataset .

This way the model is calculating the best parameters, so that each point in the regression line is as close as possible to the dataset.

At the end of your budgeting exercise, having the model parameters, you can plug in the total miles you expect to drive and estimate how much you'll need to allocate for gas.

Great, now you know that you should budget $114.5 for gas!

* You'll notice that we don't have the parameter Beta 0 in our model. In our use case, having an intercept — or a constant value when our dependent variable is equal to zero — doesn't make much sense. For this specific model, we’re forcing it to go through the origin, because if you're not driving, you won't be spending any gas money.

Next time you find yourself in a situation where you need to estimate a quantity based on a number of factors that can be described by a straight line — you know you can use a Linear Regression Model .

Thanks for reading!