4.a) Explain Exploratory Data Analysis
Answer:
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods.
- It is the First step towards building a model. The understanding of the problem you are working on is changing as you go. – Thereby “Exploratory”.
Philosophy of EDA
- Gain Intuition about the data, Make comparisons between distributions, sanity checking (ensuring data is on the expected scale and format), missing data analysis, outlier analysis and summarize it.
- In the context of data generated from logs, EDA helps with the debugging process. Patterns found in the data could be something actually wrong with the logging process that needs fixing. If you never go to the trouble of debugging, you’ll continue to think your patterns are real.
- EDA helps to ensure that the product is performing as intended.
- The insights drawn from EDA can be used to improve the development of algorithms.
- Example: Develop a ranking algorithm that ranks content shown to the users. – Develop a notion of “Popular”
- Before deciding how to quantify popularity (no. of clicks, most commented, average etc.) the behaviour of the data needs to be understood.
Exercise: EDA There are 31 datasets named nyt1.csv, nyt2.csv,…,nyt31.csv, which you can find here: https://github.com/oreillymedia/doing_data_science.
Each one represents one (simulated) day’s worth of ads shown and clicks recorded on the New York Times home page in May 2012. Each row represents a single user. There are five columns: age, gender (0=female, 1=male), number impressions, number clicks, and logged in.
We use R to handle these data. It’s a programming language designed specifically for data analysis, and it’s pretty intuitive to start using. Code can be written based on the following logic,
- Reading Data: Loading a dataset from a URL.
- Categorization: Creating age categories based on the ‘Age’ variable.
- Summary Statistics: Generating summary statistics for the dataset and for age categories.
- Visualization: Creating histograms and boxplots to visualize data distribution.
- Click-Through Rate (CTR): Calculating and visualizing the click-through rate.
- Creating Categories: Creating a new column ‘scode’ to categorize data based on impressions and clicks.
- Converting to Factor: Converting the newly created column into a factor.
- Summary Table: Generating a summary table for impressions based on the created categories.