Version Space:
A Version Space is the subset of hypotheses from the hypothesis space H that are consistent with all training instances. It includes only those hypotheses that correctly classify all observed examples.
List-Then-Eliminate Algorithm:
It is a basic version space algorithm that works as follows:
Algorithm Steps:
- Initialize version space with all hypotheses from H.
- For every training instance, eliminate the hypotheses that are inconsistent.
Limitation:
- Works well only when hypothesis space is small.
- Difficult to implement in real-world large hypothesis spaces.
Candidate Elimination Algorithm:
To overcome the limitations of List-Then-Eliminate, Candidate Elimination Algorithm is used.
It computes the version space using two boundaries:
- S (Specific Boundary) – Most specific hypotheses
- G (General Boundary) – Most general hypotheses
Algorithm Steps:
Input: Set of training instances
Output: Sets S and G – Specific and General boundaries
- Initialize:
- G = most general hypothesis
- S = most specific hypothesis
- For each training instance:
- If positive:
- Generalize S to include the instance
- If attribute differs → change to ‘?’
- If same → no change
- Prune G to remove inconsistent hypotheses
- Generalize S to include the instance
- If negative:
- Specialize G to exclude the instance
- Replace differing attributes with specific values
- If same → convert to ‘?’
- Remove inconsistent hypotheses from G
- Specialize G to exclude the instance
- If positive:
Generating Positive Hypothesis (S):
- For each positive instance, generalize S to include it
- If S and instance differ at a position → put
?
- If they match → no change
- Negative instances are skipped when updating S
Generating Negative Hypothesis (G):
- For a negative instance, refine G to exclude it
- For every differing attribute between S and negative instance, replace that with S’s value
- If attribute is same in S and negative instance, replace it with
?
- Then prune G to remove inconsistent hypotheses
Version Space:
- After processing all instances, version space is computed from S and G.
- A hypothesis is included in version space only if it lies between S and G boundaries.
- In short:
Version Space = {hypotheses h | G ≥ h ≥ S}
Let’s apply the algorithm on the given dataset with 4 training instances:
CGPA | Interactiveness | Practical Knowledge | Communication | Logic | Interest | Job Offer |
---|---|---|---|---|---|---|
≥9 | Yes | Excellent | Good | Fast | Yes | Yes |
≥9 | Yes | Good | Good | Fast | Yes | Yes |
≥8 | No | Good | Good | Fast | No | No |
≥9 | Yes | Good | Good | Slow | No | Yes |
Solution:
Step 1: Initialization
- S =
<ϕ, ϕ, ϕ, ϕ, ϕ, ϕ>
- G =
< ?, ?, ?, ?, ?, ? >
Iteration 1: Positive Instance I1
I1 = <≥9, Yes, Excellent, Good, Fast, Yes>
- Generalize S to include I1:
S1 =<≥9, Yes, Excellent, Good, Fast, Yes>
- G1 = remains unchanged.
Iteration 2: Positive Instance I2
I2 = <≥9, Yes, Good, Good, Fast, Yes>
- Generalize S1 by comparing with I2:
- Mismatch at 3rd attribute → replace with
?
S2 =<≥9, Yes, ?, Good, Fast, Yes>
- Mismatch at 3rd attribute → replace with
- G2 = still consistent → remains unchanged.
Iteration 3: Negative Instance I3
I3 = <≥8, No, Good, Good, Fast, No>
- Specialize G to exclude I3 but stay consistent with S2
- Generate hypotheses differing from I3 but matching S2:
G3 =<≥9, ?, ?, ?, ?, ?>
S3 = remains unchanged.
<?, Yes, ?, ?, ?, ?>
<?, ?, ?, ?, ?, Yes>
Iteration 4: Positive Instance I4
I4 = <≥9, Yes, Good, Good, Slow, No>
- Compare with S3: Mismatch at 3rd and 5th attribute → generalize to
?
S4 =<≥9, Yes, ?, Good, ?, ?>
- Prune G3:
- Third hypothesis
< ?, ?, ?, ?, ?, Yes >
is inconsistent → removed
- Third hypothesis
G4 =<≥9, ?, ?, ?, ?, ?>
<?, Yes, ?, ?, ?, ?>
Final Version Space:
Version Space = All hypotheses between S4 and G4:
G4 =<≥9, ?, ?, ?, ?, ?>
< ?, Yes, ?, ?, ?, ?>
S4 = <≥9, Yes, ?, Good, ?, ?>
