Linear regression is a valuable measure to understand the correlation between variables and how they can impact each other. But starting from a quick overview, let’s see what linear regression is, the role of the correlation coefficient, and what relationship can be predicted through linear regression. Afterwards, we’ll have a step-by-step guide on how to find linear regression in Power BI. This article will also demonstrate how to find correlation coefficients and what can be the limitations of linear regression.
What is linear regression?
Linear regression models or predicts the relationship between two factors (variables) via a linear equation to observe the data. Where each observation consists of two variables and a straight line, called a trend line or regression line, estimates the connection between the two variables.
One variable is an explanatory i.e., an independent variable while the other is a dependent variable. The factor that is being predicted is called the dependent variable or regressand. While the factor that is used to predict the dependent variable is called the independent variable or predictor or regressor.
What is the impact of correlation analysis on linear regression?
However, a modeller needs to verify the existence of any relationship between the variables prior to fitting a linear model to the observations. This implies that there should be a significant correlation between the two variables of interest. A scatterplot is useful in deciding the relationship strength between the two variables. If no relationship between the intended independent and dependent variables exists, then the scatterplot does not specify any trend. In such cases, employing a linear regression model to the observed data will not deliver a practical model. An invaluable numerical measure for determining the strength of correlation between two variables is the correlation coefficient. It is a value between -1.00 and +1.00 and clarifies the direction and magnitude of strength between the two variables.
The equation for linear regression
A linear regression line has an equation of the form
y is the predicted value of the dependent/output variable, for any given value of the independent variable (x).
c is the intercept, the predicted value of y when the x is 0.
m is the regression coefficient – how much we expect y to change as x increases.
x is the independent/input variable.
The direction of the Regression line and relation between factors
Depending upon the direction of the trend line in a scatter plot, any one of the three relations between factors can be predicted.
- No relation
Having a flat graph line i.e., not inclined, in a linear regression means no relation exists between the two variables.
- Positive Relation
The relation between two variables will be positive if the graph line is inclining upward with its lower point at the Y-axis and its upper point reaching up in the graph field and moving away from X-axis.
- Negative Relation
The relation between two variables will be negative if the graph line is inclining downward with its upper point at the Y-axis and its lower point reaching down in the graph field and moving towards from X-axis.
How to find Linear regression in Power BI?
To show the linear regression in Microsoft Power BI, we have used the sales data i.e., how the price ($) of houses is related to the area (ft2). For the simplicity of understanding and avoiding complexity, our dataset comprised 50 observations being stored in an excel file. So, starting from this blank interface on Power BI,
Let’s have a step-by-step guide on how we can create a scatter plot and show linear regression on Power BI.
Step 1: Getting Data for linear regression in Power BI
The first step is to get your dataset into Power BI.
Dataset can be on SQL Server, Oracle, Power BI datasets, Power BI dataflows, MySQL database, Text/CSV, PDF, Access, XML, JSON, or on any other source. Our is in Excel. So, referring to that, go and click on the Get Data tab and select Excel workbook.
A dialogue box will open. Reach your destination, select, and open your Excel file which in this example is “House Pricing.xlsx”.
Now select a sheet in your workbook having your dataset and load the dataset into Power BI.
Step 2: Creating Scatter Plot for Linear Regression
Now dataset has been loaded into Power BI and the second step would be to design a scatter plot for linear regression analysis. For this,
Go into Fields, click on your sheet, and then on a column to select it and set a measure. In this example, we first selected the column “Area in Square Feet”.
After selecting a column, go to the “Column tools” tab and then to “Properties” in the ribbon. After that, set the measure for “Summarization” from “sum” to “Don’t Summerize”.
Repeat this step for your other variable. In this case, the other column i.e., “Price in USD” represents the other variable. We selected that column and set the measure to “Don’t Summarize” for this too.
Now, we have selected our columns of variables and now we have to select for SCATTER PLOT. For this, go to the “Visualizations” and select “SCATTER PLOT”
Now select your independent and dependent variable for your x and y-axis (in the plot) either from “fields” or from the “Details” section below the “Visualizations”. In this example, the factor “Area” is an independent variable and is set at the x-axis while the factor “Price” being the dependent variable is set on the y-axis.
Step 3: Adding Trend Line in Scatter Plot for linear regression
After specifying, the x and y-axis, the next step is to add a trend line. Now, we are going to add a trend line to our plot. For this, click on “Analytics” and then on “Trend Line”.
In the “Trend Line” click on “ADD”
And select the colour of your choice. In this example, we set the colour of the trend line to red.
This trend line is showing the relation i.e., the correlation between two values which is positive and implies as the independent variable i.e., Area of House (in ft2) increases, the dependent variable i.e., Price (in USD), also increases.
How to find a correlation coefficient in Power BI?
To find the correlation coefficient, first either click on the “quick measure” button or right click on your sheet (in fields) and select “New quick measures”. A dialogue box will open. In it, go to “Calculation”, click on it, scroll down the drop-down menu and select “correlation coefficient”. Select “Area” on “Measure X”, “Price” on “Measure Y” and in “Category” and click “OK”.
The correlation coefficient will be available in the table (In fields). To check the value of the correlation coefficient, select “card” from “visualizations” and drag “correlation coefficient” from the table into the field. In this example, the value of the correlation coefficient came out to be 1.00 which shows a strong positive relationship between our interested variables.
What are the limits of linear regression?
- Limited to linear relationships
One of the limitations of simple linear regression is its constraint on linear relationships. It cannot validate the relationship between variables whose outcomes come as a curved line in the graph.
- Does not illustrate the complete relationship between variables
The other limitation is that linear regression considers the mean of the dependent variable. Where mean cannot completely illustrate a single variable, in the same context, linear regression cannot completely illustrate the relationships between variables.
- Sensitive to outliers
Outliers are data that lie at an abnormal distance from other values and are surprising. These outliers have a huge impact on regression. This implies that linear regression is sensitive to outliers.
- Cannot predict the type of relationship
Also, correlation is not the same as causation. By this, it means that linear regression considers the data as independent i.e., one variable does not cause the other to occur. Linear regression only permits finding out if the connection between variables exists or not. To see exactly what the relationship is between variables and is there any causation, we need extra investigation and statistical analysis.
Linear Regression is handy in various verticals and business cases. While it is crucial to perform such analysis on large datasets, putting it all in Power BI and performing your crucial analysis can make significant plausible capacities.