Linear Regression: Notes and Interview Questions

What is a Linear Regression?

Linear regression is adopting a linear approach to modeling the relationship between a dependent variable (scalar response) and one or more independent variables (explanatory variables).

What Are the Basic Assumptions?

- Linear relationship: there is a linear relationship between the features and target.
- Multivariate normality: all variables to be multivariate normal. When the data is not normally distributed, a non-linear transformation might help. (KS test is used to check normality)
- No multi-collinearity: independent variables should not be too highly correlated with each other. (drop one of the variables)
- No auto-correlation: residuals should not be dependent on each other. (DW test is used to detect autocorrelation)
- Should be Homoscedastic: variance/spread of the errors should be constant. (use the Box-Cox normality plot to transform Y variable to achieve homoscedasticity)
- Normality: error terms should be normally distributed.

Advantages

Linear regression performs exceptionally well for linearly separable data.
Easy to implement and train the model.
It can handle overfitting using dimensionality reduction techniques, regularization and cross-validation.

Disadvantages

Limited to Linear Relationships.
Only looks at a relationship between the mean of the dependent variable and the independent variables. 
Sensitive to Outliers.
Data Must Be Independent.

Whether Feature Scaling is required?

Yes.

Impact of Missing Values?

It is sensitive to missing values.

Impact of outliers?

Linear regression needs the relationship between the independent and dependent variables to be linear. It is also important to check for outliers since linear regression is sensitive to outlier effects. Model accuracy can be improved by handling the outliers.

Can we apply linear regression to nonlinear data?

No. But we can apply once we apply log transform on our data to make it linear.

Why do we square the error instead of using modulus?

Squared error is everywhere differentiable, while the absolute error is not (its derivative is undefined at 0). This makes the squared error more amenable to the techniques of mathematical optimization. To optimize the squared error, we can just set its derivative equal to 0 and solve.

What are techniques to find the slope and the intercept of the linear regression line which best fits the model?

Ordinary Least Squares (Statistics domain)
Gradient Descent (Calculus family)

Linear Regression vs Ordinary Least Square Regression.

Although Linear Regression refers to any approach to model the relationship between one or more variables, Least Squares or OLS is just one of the technique to do linear regression.

Explain Ordinary Least Squares Regression.

OLS is called least squares method because, it finds out thw slope and intercept in such a way to minimize the sum of the squares of the differences between actual and estimated values of the predictor.
OLS is computationally too expensive. It performs well with small data. For larger data Gradient Descent is preferred.

Explain Gradient Descent.

Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable function.
It’s based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum.
In machine learning problem we train our algorithm with gradient descent to minimize our cost-function J(w, b) and reach its local minimum by tweaking its parameters (w and b).

How to evaluate regression models?

Mean Absolute Error (MAE)
Mean Squared Error (MSE)
Root Mean Squared Error (RMSE)
R-Squared (Coefficient of Determination)
Adjusted R-Squared

Which evaluation technique should you prefer to use for data having a lot of outliers in it?

MAE is preferable to use for data having too many outliers in it because MAE is robust to outliers whereas MSE and RMSE are very susceptible to outliers and starts penalizing the outliers by squaring the residuals.

What’s the intuition behind R-Squared?

R2 explains the degree to which your input variables explain the variation of your predicted variable. So, if R-square is 0.8, it means 80% of the variation in the output variable is explained by the input variables. R² is also called the coefficient of determination.

R-squared = (TSS-RSS)/TSS
TSS = Total Sum of Squares
RSS = Residual Sum of Squares

What are the flaws in R-squared?

R² increases with every predictor added to a model. As R² always increases and never decreases, it can appear to be a better fit with the more terms we add to the model. This can be completely misleading.

What is adjusted R²?

Adjusted R-square penalizes you for adding variables which do not improve your existing model. It is always suggested that you use Adjusted R-squared to judge goodness of model.

What happens when we add a variable and it increases the R-Sq but decreases the Adj R-Sq?

The variable can be omitted since it holds no predictive power.
We should also look at the p-value of the added variable and confirm our decision.

Differences between correlation and regression.

Correlation shows the relationship between the two variables, while regression allows us to see how one affects the other.
Correlation does not capture the direction of causal relationship. Regression captures the cause and effect.

Can we use linear regression for time series analysis?

One can use linear regression for time series analysis, but the results are not promising.
Time series data is mostly used for the prediction of the future, but linear regression seldom gives good results for future prediction.
Mostly, time series data have a pattern, such as during peak hours, festive seasons, etc., which would most likely be treated as outliers in the linear regression analysis.

Explain Normal Equation in Linear Regression.

Normal Equation is an analytical approach to Linear Regression with a Least Square Cost Function. We can directly find out the value of θ without using Gradient Descent.

Y=βTX is the model for the linear regression.
The normal equation for linear regression is β=(XTX)-1.XTY

Below shows the steps. Replace β by θ.


Gradient Descent vs Normal Regression.

GD - Needs hyper-parameter tuning for learning parameter, iterative process, prefered when n is extremely large.
NR - No need of hyper-parameter tuning, non-iterative process, becomes quite slow for large values of n.

How do you interpret a linear regression model?

Y = 3 + 5X1 + 6X2

Interpreting intercept:
Y=3 if both X1 = 0 and X2 = 0.

Interpreting Coefficients of Continuous Predictor Variables:
a unit increase in X1 results in an increase in average Y by 5 units, all other variables held constant.

Interpreting Coefficients of Categorical Predictor Variables: 
if X2 is sex = Male or Female, make dummies (Male:0, Female:1)
average Y is higher by 6 units for females than for males, all other variables held constant.

Interpolation vs Extrapolation.

Regression models predict a value of Y, given known values of the X. Prediction within the range of values in the data set used for model-fitting is known as interpolation. 
Prediction outside this range of the data is known as extrapolation. The further the extrapolation goes outside the data, the more room there is for the model to fail

What is the use of regularisation?

Used to tackle the problem of overfitting of the model. 
Regularisation is adding the coefficient terms (betas) to the cost function so that the terms are penalized and are small in magnitude.

L2 or Ridge regularisation?

We add a penalty term to the cost function which is equal to the square of the coefficient.
Ridge decreases the complexity of a model but does not reduce the number of variables since it never leads to a coefficient been zero rather only minimizes it. Hence, this model is not good for feature reduction.

L1 or LASSO regularisation?

We add a penalty term to the cost function which is equal to the absolute values of the coefficients.
The difference between ridge and lasso regression is that it tends to make coefficients to absolute zero as compared to Ridge.

How to choose the value of the regularisation parameter (λ)?

If λ is too high, it will lead to extremely small values of the regression coefficient β, which will lead to the model underfitting (high bias – low variance). 
If λ is 0 (very small), the model will tend to overfit the training data (low bias – high variance). 
Run the algorithm multiple times on different sets and decide how much variance can be tolerated.

ElasticNet regularisation?

Sometimes, the lasso regression can cause a small bias in the model where the prediction is too dependent upon a particular variable. In these cases, elastic Net is proved to better it combines the regularization of both Lasso and Ridge.


**All questions and notes have been compiled from various sources.

Comments

Post a Comment

Popular posts from this blog

Recommendation Systems: Notes and Interview Questions

Logistic Regression: Notes and Interview Questions