Multiple Regression

Machine learning and modelling are very fashionable in data science right now. They are built to reliably predict a dependent variable based on multiple independent factors.

The simplest type of machine learning is called multiple regression. It takes the information from your dataset to form a linear relationship. While more complex AI and machine learning methods receive a lot of fanfare, they can underperform compared to this simple technique.

 

Regression vs ANOVA

ANOVAs are good for testing a null hypothesis and looking for differences between or within groups. Regression on the other hand, can model and predict your dependent variables. When you are using regression, you are developing models that best describe and represent your data.

There are a few underlying assumptions for regression:

  1. There needs to be a linear relationship between dependent and independent variables.
  2. The independent variables cannot be too highly correlated with each other.
  3. All of your observations must be selected randomly and independently from your population.

The residuals of your data points describe how they deviate from your regression model. Residuals should be normally distributed with a mean close to 0. After conducting the regression, the coefficient of determination (R2) measures how well your model fits the data. Your R2 lies between the values 0 and 1. The higher the value, the better the fit.

Does adding more predictors to your model improve its R2? Yes, but it gets problematic when some of these predictors have nothing to do with your model. The adjusted R2 takes this into account, giving you a better approximation of your model predictive value. Thus, adjusted R2 will usually be lower than R2.

 

Different Flavors of Regression

We have introduced the standard multiple regression already, where independent factors are combined to model a dependent variable or outcomes. But there are other ways of developing your regression model.

Hierarchical regression measures the impact of different factors or independent variables on your regression model. You can compare how well different models perform by looking at the coefficient of variation. The higher the coefficient of variation, the better the model explains the variance in your regression. Often, this will require you to perform regression multiple different times manually.

On the other hand, stepwise multiple regression is a more automated process. It will automatically comb through your independent variables to find the best combination to describe your data.

 

Takeaway

Linear regression is a powerful machine learning technique that determines how well independent variables explain a dependent outcome. Regression can develop models of different phenomena, allowing for predictions to be made. But, it does not explicitly test a null hypothesis to measure if group differences exist.

The coefficient of determination describes how well the regression fits to your dataset. It is a measure between 0 and 1, where a higher value indicates a higher percentage of sample variance is explained with the model. But it runs the risk of overfitting if many bogus variables are also included in the model.

To counter this problem, hierarchical and stepwise regression are used to determine the most important predictive features of a dataset.