Data preprocessing

1. Data Collection:

Data collection is the first step in the data analytics process. It involves gathering high-quality, relevant datasets. Good data is essential for accurate machine learning and analytics.

Qualities of Good Data:

  1. Timeliness – Data must be recent and up-to-date.
  2. Relevancy – It should match the machine learning or business needs.
  3. Understandability – Data should be clear, interpretable, and aligned with domain knowledge.

Sources of Data:

  1. Open/Public Data
    • Free and available without restrictions.
    • Examples: Government census, scientific experiments, healthcare data.
  2. Social Media Data
    • Data from platforms like Twitter, Facebook, Instagram, YouTube.
  3. Multimodal Data
    • Data with multiple formats: text, audio, image, video.
    • Examples: Image archives, web pages.

2. Data Preprocessing:

Raw data is often “dirty”, meaning it contains errors, missing values, and inconsistencies. Preprocessing is essential to clean and prepare data for analytics or ML.

Common Data Issues:

  • Incomplete data
  • Inaccurate or inconsistent values
  • Missing values
  • Duplicate entries
  • Outliers and noise

Example Table – Illustration of ‘Bad’ Data

Patient IDNameAgeDate of Birth (DoB)FeverSalary
1.John21Low–1500
2.Andre36HighYes
3.David510/10/1980Low“ ”
4.Raju136HighYes

Explanation of Dirty Data

  • Missing Data: Salary is missing for David. DoB is missing for John, Andre, and Raju.
  • Inconsistent Data: David’s age is 5, but DoB is 10/10/1980 → mismatch.
  • Noisy Data: John’s salary is –1500, which is not possible.
  • Outlier: Raju’s age is 136 – likely a typographical error.

These errors affect machine learning accuracy and must be handled during data cleaning.


Data Cleaning Techniques:

A. Handling Missing Data:

  1. Ignore the tuple – Remove rows with missing data.
  2. Manual filling – Domain expert fills values.
  3. Global constant – Use ‘Unknown’ or ‘Infinity’.
  4. Attribute mean – Fill missing value with average.
  5. Class-wise mean – Average of values within a class.
  6. Prediction models – Use ML to predict missing values.

Note: These may introduce bias if not accurate.


B. Noise Removal Techniques:

Noise = Random error or variance in data.
Use Binning technique to smooth noisy data.

Types of Binning:

  • Smoothing by mean – Replace values with bin average.
  • Smoothing by median – Replace with bin median.
  • Smoothing by bin boundaries – Use closest bin edge.

3. Data Integration:

Merging data from multiple sources. May lead to redundant data. Aim is to identify and remove duplicates during merging.


4. Data Transformation:

Transform data for better model performance.
Normalization helps scale data to a standard range.


Normalization Techniques:

A. Min-Max Normalization

Scales data to a specific range (usually 0 to 1).

Used in Neural Networks.


B. Z-Score Normalization

Standardizes data using mean and standard deviation.


5. Data Reduction:

Used to reduce data size while retaining the same analysis results.

Techniques:

  • Data aggregation – Combine data.
  • Feature selection – Select important features.
  • Dimensionality reduction – Reduce number of attributes (like PCA).

Leave a Reply

Your email address will not be published. Required fields are marked *