4.b) Explain KNN algorithm with example
Answer:
KNN algorithm
- K-Nearest Neighbors (K-NN) is an algorithm employed for automatically labeling unclassified objects based on their similarity to already classified ones in a dataset. For instance, it could be applied to classify data scientists as “best” or “worst”, individuals as “high credit” or “low credit,” restaurants by star ratings, or patients as “high cancer risk” or “low cancer risk,” among various other applications.
- The intuition behind K-Nearest Neighbors (K-NN) is to identify the most similar items based on their attributes, examine their labels, and assign the unclassified item the majority label.
- To automate the process, two key decisions are essential: defining the measure of similarity or closeness between items and utilizing this measure to identify the most similar items, known as neighbors, to an unrated item. These neighbors contribute their “votes” towards the classification or labeling of the unrated item.
- The second decision involves determining the number of neighbors to consider for voting, denoted as “k.” As a data scientist, you’ll choose this value, which dictates the extent of influence from neighboring items on the classification or labeling of the unrated item.
Overview of the KNN process:
1. Decide on your similarity or distance metric.
2. Split the original labeled dataset into training and test data.
3. Pick an evaluation metric.
4. Run k-NN a few times, changing k and checking the evaluation measure.
5. Optimize k by picking the one with the best evaluation measure.
6. Once you’ve chosen k, use the same training set and now create a new test set with the people’s ages and incomes that you have no labels for, and want to predict. In this case, your new test set only has one lonely row, for the 57-year-old.
Example:
Consider a dataset consisting of the age, income, and a credit category of high or low for a bunch of people and you want to use the age and income to predict the credit label of “high” or “low” for a new person.
For example, here are the first few rows of a dataset, with income represented in thousands:
plot people as points on the plane and label people with an empty circle if they have low credit ratings.
- What if a new guy comes in who is 57 years old and who makes $37,000? What’s his likely credit rating label?
- Given the credit scores of other individuals nearby, what credit score label do you propose should be assigned to him? Let’s use K-Nearest Neighbors (K-NN) to automate this process.