3.a) What is Machine Learning? Explain the linear regression algorithm.
Answer:
Machine learning algorithms that are the basis of artificial intelligence (AI) such as image recognition, speech recognition, recommendation systems, ranking and personalization of content— often the basis of data products—are not usually part of a core statistics curriculum or department.
Linear regression
- Linear regression is a common statistical method used to show the mathematical relationship between two variables.
- It assumes a linear connection between an outcome variable (also called the response variable, dependent variable, or label, like sales) and a predictor variable (also called an independent variable, explanatory variable, or feature, like advertising spend). Essentially, it helps us understand how changes in one variable can predict changes in another.
- Sometimes, it makes sense that changes in one variable correlate linearly with changes in another variable. For example, it makes sense that the more umbrellas you sell, the more money you make.
Example 1:
- Suppose you run a social networking site that charges a monthly subscription fee of $25, and that this is your only source of revenue. Each month you collect data and count your number of users and total revenue.
- You’ve done this daily over the course of two years, recording it all in a spreadsheet. You could express this data as a series of points.
Here are the first four:
S= {(x, y) = (1,25) , (10,250) , (100,2500) , (200,5000)}
From the given data it is observed that y=25x, which shows that,
- There’s a linear pattern.
- The coefficient relating x and y is 25.
- It seems deterministic.
Figure: An observed linear pattern
Example 2:
The dataset, keyed by user, contains weekly behavior data for hundreds of thousands of users on a social networking site, with columns like total_num_friends, total_new_friends_this_week, num_visits, time_spent, number_apps_downloaded, number_ads_shown, gender, and age. During exploratory data analysis (EDA), a random sample of 100 users was used to plot pairs of variables, such as total_new_friends vs. time_spent. The business goal is to forecast the number of users to promise advertisers, but the current focus is on building intuition and understanding the dataset. The first few rows are listed below :
- When plotted the graph looks like below,
Figure: Dataset Plotted
- There seems to be a linear relationship between the number of new friends and the time spent on the social networking site, suggesting that more new friends lead to more time spent on the site.
- This relationship can be described using statistical methods like correlation and linear regression. Although there is an association between these variables, it is not perfectly deterministic, indicating that other factors also influence the time users spend on the site.