Joseph Nguyen is a contributing author at Investopedia and a research analyst with experience at a securities brokerage firm.
Updated January 07, 2022 Fact checked by Fact checked by Suzanne KvilhaugSuzanne is a content marketer, writer, and fact-checker. She holds a Bachelor of Science in Finance degree from Bridgewater State University and helps develop content strategies for financial brands.
If you've ever wondered how two or more pieces of data relate to each other (e.g. how GDP is impacted by changes in unemployment and inflation), or if you've ever had your boss ask you to create a forecast or analyze predictions based on relationships between variables, then learning regression analysis would be well worth your time.
In this article, you'll learn the basics of simple linear regression, sometimes called 'ordinary least squares' or OLS regression—a tool commonly used in forecasting and financial analysis. We will begin by learning the core principles of regression, first learning about covariance and correlation, and then moving on to building and interpreting a regression output. Popular business software such as Microsoft Excel can do all the regression calculations and outputs for you, but it is still important to learn the underlying mechanics.
At the heart of a regression model is the relationship between two different variables, called the dependent and independent variables. For instance, suppose you want to forecast sales for your company and you've concluded that your company's sales go up and down depending on changes in GDP.
The sales you are forecasting would be the dependent variable because their value "depends" on the value of GDP and the GDP would be the independent variable. You would then need to determine the strength of the relationship between these two variables in order to forecast sales. If GDP increases/decreases by 1%, how much will your sales increase or decrease?
C o v ( x , y ) = ∑ ( x n − x u ) ( y n − y u ) N \begin &Cov(x,y) = \sum \frac < ( x_n - x_u )( y_n - y_u) > < N >\\ \end C o v ( x , y ) = ∑ N ( x n − x u ) ( y n − y u )
The formula to calculate the relationship between two variables is called covariance. This calculation shows you the direction of the relationship. If one variable increases and the other variable tends to also increase, the covariance would be positive. If one variable goes up and the other tends to go down, then the covariance would be negative.
The actual number you get from calculating this can be hard to interpret because it isn't standardized. A covariance of five, for instance, can be interpreted as a positive relationship, but the strength of the relationship can only be said to be stronger than if the number was four or weaker than if the number was six.
C o r r e l a t i o n = ρ x y = C o v x y s x s y \begin &Correlation = \rho_ = \frac < Cov_> < s_x s_y >\\ \end C o r r e l a t i o n = ρ x y = s x s y C o v x y
We need to standardize the covariance in order to allow us to better interpret and use it in forecasting, and the result is the correlation calculation. The correlation calculation simply takes the covariance and divides it by the product of the standard deviation of the two variables. This will bind the correlation between a value of -1 and +1.
A correlation of +1 can be interpreted to suggest that both variables move perfectly positively with each other and a -1 implies they are perfectly negatively correlated. In our previous example, if the correlation is +1 and the GDP increases by 1%, then sales would increase by 1%. If the correlation is -1, a 1% increase in GDP would result in a 1% decrease in sales—the exact opposite.
Now that we know how the relative relationship between the two variables is calculated, we can develop a regression equation to forecast or predict the variable we desire. Below is the formula for a simple linear regression. The "y" is the value we are trying to forecast, the "b" is the slope of the regression line, the "x" is the value of our independent value, and the "a" represents the y-intercept.
The regression equation simply describes the relationship between the dependent variable (y) and the independent variable (x).
y = b x + a \begin &y = bx + a \\ \end y = b x + a
The intercept, or "a," is the value of y (dependent variable) if the value of x (independent variable) is zero, and so is sometimes simply referred to as the 'constant.' So if there was no change in GDP, your company would still make some sales. This value, when the change in GDP is zero, is the intercept.
Take a look at the graph below to see a graphical depiction of a regression equation. In this graph, there are only five data points represented by the five dots on the graph. Linear regression attempts to estimate a line that best fits the data (a line of best fit) and the equation of that line results in the regression equation.
Now that you understand some of the background that goes into a regression analysis, let's do a simple example using Excel's regression tools. We'll build on the previous example of trying to forecast next year's sales based on changes in GDP. The next table lists some artificial data points, but these numbers can be easily accessible in real life.
Year | Sales | GDP |
2015 | 100 | 1.00% |
2016 | 250 | 1.90% |
2017 | 275 | 2.40% |
2018 | 200 | 2.60% |
2019 | 300 | 2.90% |
Just eyeballing the table, you can see that there is going to be a positive correlation between sales and GDP. Both tend to go up together. Using Excel, all you have to do is click the Tools drop-down menu, select Data Analysis and from there choose Regression. The popup box is easy to fill in from there; your Input Y Range is your "Sales" column and your Input X Range is the change in GDP column; choose the output range for where you want the data to show up on your spreadsheet and press OK. You should see something similar to what is given in the table below:
Regression Statistics Coefficients