Posts

Recommendation Systems: Notes and Interview Questions

Image
What is Content-based Filtering? Recommends items based on a user's purchase history, ratings and feedback. Eg: Flipkart. What is Collaborative Filtering? Matches customers who bought/watched similar items/movies to recommend products. Eg: Netflix. How items are recommended in Content-based Filtering? Let's take example of Netflix. They save all the information related to each user in a vector form which contains the past behavior of the user (movies liked/disliked by the user and the ratings given by them). This vector is known as the profile vector . All the information related to movies is stored in item vector . Item vector contains the details of each movie, like genre, cast, director, etc. The content-based filtering algorithm finds the cosine of the angle between the profile vector and item vector, i.e. cosine similarity . Suppose A is the profile vector and B is the item vector, then the similarity between them can be calculated as: Based on the cosine value (between -1

Random Forest: Notes and Interview Questions

Image
What is bias? What is variance? Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. For high bias the difference is high, and for low bias it's low. Model with high bias always leads to high error on training and test data. High bias would cause an algorithm to miss relevant relations between the input features and the target outputs. This is sometimes referred to as underfitting. Low Bias: Suggests fewer assumptions about the form of the target function. High-Bias: Suggests more assumptions about the form of the target function. Examples of low-bias: Decision Trees, k-Nearest Neighbors, Support Vector Machines. Examples of high-bias: Linear Regression, Linear Discriminant Analysis, Logistic Regression. Variance is the value that tells us about the spread of our data. High variance means the predicted values are more scattered in relation to each other, and low variance means less scattered. Model with high varian

Decision Tree: Notes and Interview Questions

What is a Decision Tree? Decision tree is a flowchart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and each leaf node (terminal node) holds a class label. Advantages of Decision Tree. - Simple to understand, interpret and visualize. - Used for both classification and regression problems. - Handle both continuous and categorical variables. - No feature scaling required as it uses a rule-based approach instead of distance calculation. - Handles non-linear parameters efficiently. - Automatically handle missing values. - Robust to outliers and can handle them automatically. Disadvantages of Decision Tree. - Generally leads to overfitting of the data which ultimately leads to wrong predictions. - Due to the overfitting, there are very high chances of high variance in the output which leads to many errors in the final estimation. - Adding a new data point can lead to regeneration of the overall tree and all nodes

Linear Regression: Notes and Interview Questions

Image
What is a Linear Regression? Linear regression is adopting a linear approach to modeling the relationship between a dependent variable (scalar response) and one or more independent variables (explanatory variables). What Are the Basic Assumptions? - Linear relationship: there is a linear relationship between the features and target. - Multivariate normality: all variables to be multivariate normal. When the data is not normally distributed, a non-linear transformation might help. (KS test is used to check normality) - No multi-collinearity: independent variables should not be too highly correlated with each other. (drop one of the variables) - No auto-correlation: residuals should not be dependent on each other. (DW test is used to detect autocorrelation) - Should be Homoscedastic: variance/spread of the errors should be constant. (use the Box-Cox normality plot to transform Y variable to achieve homoscedasticity) - Normality: error terms should be normally distributed. Advantages Line

Logistic Regression: Notes and Interview Questions

What is Logistic Regression? It’s a classification algorithm, that is used where the response variable is categorical. The idea of Logistic Regression is to find a relationship between features and the probability of a particular outcome. Binomial Logistic Regression - response variable has two values 0 and 1 or pass and fail. Multinomial Logistic Regression - response variable can have three or more possible values. The idea of Logistic Regression. f(z) = 1/(1+e  -z  ) The values of Z will vary from -infinity to +infinity. The values of a logistic function will range from 0 to 1. Logistic regression can convert the values of logits (logodds), which can range from -infinity to +infinity to a range between 0 and 1. What are the assumptions of Logistic Regression? Linear Relation between independent features and the log odds (logit of the outcome). No multicollinearity among predictors. Observations to be independent of each other. Advantages Logistic Regression Are very easy to understa