3.a) Explain the Data science Process with a neat diagram.
Answer:
Data science Process:
- The real world where different types of data is generated. Inside the Real World are lots of people busy at various activities. Some people are using Google+, others are competing in the Olympics; there are spammers sending spam, and there are people getting their blood drawn. Say we have data on one of these things.
- Raw data is recorded. Lot of aspects to these real word activities are lost even when we have that raw data. Real world data is not clean. The raw data is processed to make it clean for analysis. We build and use data munging pipelines (joining, scraping, wrangling). This done with Python, R, SQL Shell scripts.
- Eventually data is brought into a format with columns.
name | event | year | gender | event time
- The EDA process can now be started. During the course of the EDA we may find that the data is not actually clean as there are missing values, outliers, incorrectly logged data or data that was not logged.
- In such a case, we may have to collect more data or we can spend more time cleaning the data (Imputation). The model is designed to use some algorithm (K-NN, Linear Regression, Naïve Bayes, Decision Tree, Random Forest etc) Model Selection depends on type of problem being addressed – Prediction, Classification or a basic description problem.
- Alternatively, our goal may be to build or prototype a “data product” such as a spam classifier, search ranking algorithm or a recommendation system. The key difference here that differentiates data science from statistics here is that, the data product is incorporated back into the real world and users interact with it and that generates more data, which creates a feedback loop.
- A Movie Recommendation system generates evidence that lots of people love a movie. This will lead to more people watching the movie – feedback loop.
- Take this loop into account in any analysis you do by adjusting for any biases your model caused. Your models are not just predicting the future, but causing it!.