1. Data Collection:

Data collection is the first step in the data analytics process. It involves gathering high-quality, relevant datasets. Good data is essential for accurate machine learning and analytics.

Qualities of Good Data:

Timeliness – Data must be recent and up-to-date.
Relevancy – It should match the machine learning or business needs.

Understandability – Data should be clear, interpretable, and aligned with domain knowledge.

Sources of Data:

Open/Public Data
- Free and available without restrictions.
- Examples: Government census, scientific experiments, healthcare data.
Social Media Data
- Data from platforms like Twitter, Facebook, Instagram, YouTube.

Multimodal Data
- Data with multiple formats: text, audio, image, video.
- Examples: Image archives, web pages.

2. Data Preprocessing:

Raw data is often “dirty”, meaning it contains errors, missing values, and inconsistencies. Preprocessing is essential to clean and prepare data for analytics or ML.

Common Data Issues:

Incomplete data
Inaccurate or inconsistent values

Missing values
Duplicate entries
Outliers and noise

Example Table – Illustration of ‘Bad’ Data

Patient ID	Name	Age	Date of Birth (DoB)	Fever	Salary
1.	John	21	–	Low	–1500
2.	Andre	36	–	High	Yes
3.	David	5	10/10/1980	Low	“ ”
4.	Raju	136	–	High	Yes

Explanation of Dirty Data

Missing Data: Salary is missing for David. DoB is missing for John, Andre, and Raju.
Inconsistent Data: David’s age is 5, but DoB is 10/10/1980 → mismatch.
Noisy Data: John’s salary is –1500, which is not possible.

Outlier: Raju’s age is 136 – likely a typographical error.

These errors affect machine learning accuracy and must be handled during data cleaning.

Data Cleaning Techniques:

A. Handling Missing Data:

Ignore the tuple – Remove rows with missing data.

Manual filling – Domain expert fills values.
Global constant – Use ‘Unknown’ or ‘Infinity’.
Attribute mean – Fill missing value with average.

Class-wise mean – Average of values within a class.
Prediction models – Use ML to predict missing values.

Note: These may introduce bias if not accurate.

B. Noise Removal Techniques:

Noise = Random error or variance in data.
Use Binning technique to smooth noisy data.

Types of Binning:

Smoothing by mean – Replace values with bin average.

Smoothing by median – Replace with bin median.
Smoothing by bin boundaries – Use closest bin edge.

3. Data Integration:

Merging data from multiple sources. May lead to redundant data. Aim is to identify and remove duplicates during merging.

4. Data Transformation:

Transform data for better model performance.
Normalization helps scale data to a standard range.

Normalization Techniques:

A. Min-Max Normalization

Scales data to a specific range (usually 0 to 1).

Used in Neural Networks.

B. Z-Score Normalization

Standardizes data using mean and standard deviation.

5. Data Reduction:

Used to reduce data size while retaining the same analysis results.

Techniques:

Data aggregation – Combine data.
Feature selection – Select important features.
Dimensionality reduction – Reduce number of attributes (like PCA).

Data preprocessing

1. Data Collection:

Qualities of Good Data:

Sources of Data:

2. Data Preprocessing:

Common Data Issues:

Example Table – Illustration of ‘Bad’ Data

Explanation of Dirty Data

Data Cleaning Techniques:

A. Handling Missing Data:

B. Noise Removal Techniques:

3. Data Integration:

4. Data Transformation:

Normalization Techniques:

A. Min-Max Normalization

B. Z-Score Normalization

5. Data Reduction:

Leave a ReplyCancel Reply

1. Data Collection:

Qualities of Good Data:

Sources of Data:

2. Data Preprocessing:

Common Data Issues:

Example Table – Illustration of ‘Bad’ Data

Explanation of Dirty Data

Data Cleaning Techniques:

A. Handling Missing Data:

B. Noise Removal Techniques:

3. Data Integration:

4. Data Transformation:

Normalization Techniques:

A. Min-Max Normalization

B. Z-Score Normalization

5. Data Reduction:

Related Posts

Continuous and Discrete probability distributions

BCS402 – Advanced Java Solved Previous Year Question Paper

Explain any four string modification methods of String class

Leave a ReplyCancel Reply