Tokens, Patterns, and Lexemes
In Lexical Analysis, the terms token, pattern, and lexeme are fundamental building blocks used by the lexer (lexical analyzer) to convert source code into meaningful units.
1. Token
A token is a category or class of language elements.
It is a pair:
- Token name – Abstract label (like
id
,number
,if
, etc.) - Attribute value (optional) – Holds extra info like the actual lexeme (e.g., variable name)
Think of a token like a label or tag for a specific part of code.
2. Pattern
A pattern is a rule or description that defines what a valid lexeme for a token looks like.
It can be:
- A fixed string (e.g.,
"if"
for the keywordif
) - A regular expression (e.g.,
[a-zA-Z_][a-zA-Z0-9_]*
for identifiers)
3. Lexeme
A lexeme is the actual sequence of characters in the source code that matches a pattern and gets recognized as a token.
Table Example – Tokens, Patterns, and Lexemes
Token Name | Pattern (Description) | Sample Lexemes |
---|---|---|
if | The characters i , f | if |
else | The characters e , l , s , e | else |
comparison | < , > , <= , >= , == , != | >= , == |
id | A letter followed by letters/digits | score , pi , D2 |
number | Any numeric constant | 3.14 , 6.02e23 , 0 |
literal | Any string enclosed in quotes | "Total =" , "Hello" |
Example Statement:
printf(“Total = “, score);
Lexeme | Token Name | Pattern Description |
---|---|---|
printf | id | Identifier: starts with letter, followed by alphanumerics |
"Total = " | literal | String enclosed in quotes |
score | id | Identifier |
( , ) , , | Symbols | Each treated as individual tokens |
; | Symbol | Token for semicolon |
Summary:
- Token: Type/label of lexical unit (
id
,number
,if
, etc.) - Pattern: Rule that defines valid strings for the token
- Lexeme: Actual string from source code that matches a pattern