Overview
Hand-written, LUT-driven scanner. It walks source as raw bytes and produces Token { kind, line, start, end }. Tokens carry byte indices, never text copies. The parser slices lazily for identifier and string content.
Linear time O(n). Per-byte dispatch is branchless through two lookup tables. Lex-time diagnostics (unterminated strings, bad indent, unknown bytes, malformed underscores, oversized f-string nesting) collect in a Vec<LexError> returned alongside the token stream. The parser folds them in for a single coherent report.
A leading UTF-8 BOM (EF BB BF) is stripped before tokenization so the first identifier doesn’t fuse with the marker.
Token kinds
The token set tracks Python 3.13.12 closely. Categories implemented:
- Keywords:
False,None,True,and,as,assert,async,await,break,class,continue,def,del,elif,else,except,finally,for,from,global,if,import,in,is,lambda,nonlocal,not,or,pass,raise,return,try,while,with,yield. - Soft keywords:
type,match, andcasedemote toNamewhen followed by(,:,=,,,),],Newline, orEOF, sotype(),match(...), and identifiers named like them stay usable. At statement start (match x:) they keep keyword force. - Wildcard: Underscore (
_) gets its ownUnderscoretoken. The parser distinguishes wildcard from name use. - Operators: 1-, 2-, and 3-character operator forms (
+,==,**=,//=, etc.). - Delimiters:
( ) [ ] { } : , ; .. - Literals:
Name,Int,Float,String,Bytes. There is noComplextoken. A trailingj/Jis not lexed as a complex suffix.1jtokenises asInt(1)followed byName("j"). - F-string segments:
FstringStart,FstringMiddle,FstringEnd. - Whitespace and structure:
Comment,Newline,Indent,Dedent,Nl,Endmarker.
Dispatch tables
Two compile-time tables in lexer/tables.rs:
// Bit flags per byte: ID_START, ID_CONT, DIGIT, SPACE.
pub static BYTE_CLASS: [u8; 256] = { /* ... */ };
// Single-char operator dispatch.
pub static SINGLE_TOK: [u8; 128] = { /* ... */ };
pub const SINGLE_MAP: [TokenType; 24] = { /* ... */ };Identifiers, digits, and whitespace use a scan_while(pred) driver looping over BYTE_CLASS[b] & FLAG. Single-char operators do b -> SINGLE_TOK[b] -> SINGLE_MAP[i], two indexed loads. Keyword lookup is routed by (length, first_byte) to skip most memcmps.
Numeric literals
42
1_000_000 # underscore separators
0xDEAD_BEEF # hex
0o777 # octal
0b1010_1010 # binary
3.14
.5 # leading-dot float
1e-5 # exponentThe scanner handles base prefixes (0x / 0o / 0b, case-insensitive), underscore separators, optional exponents, and leading-dot form. Int and Float are the only numeric token kinds.
Underscores must sit between digits. Leading, trailing, or doubled underscores raise invalid '_' in numeric literal or consecutive '_' in numeric literal. Empty radix body (0x, 0o, 0b) raises missing digits in numeric literal. Trailing dot (5.) is valid. Empty exponent body (1e) is left to the float parser to avoid false-positives in format specs.
Complex literals unsupported: 1j lexes as Int(1) + Name("j").
String prefixes
'plain' # str
b'bytes' # bytes (lexed as String)
r'raw\n' # raw
u'unicode' # unicode
br'rawbytes' # raw bytes
RB'mixed' # any case combination
f'fstring' # f-string (separate token sequence)
fr'raw fstring' # raw f-string
"""triple""" # triple-quoted, single or doubleA leading prefix is recognised before the opening quote by the identifier scanner, verified against is_string_prefix / is_fstring_prefix / is_bytes_prefix. Triple-quoted strings span newlines (bumping line per \n). Backslash escapes are consumed at lex time but decoded by the parser. Recognised escapes: \n \t \r \a \b \f \v \\ \' \" \xHH \uHHHH \UHHHHHHHH plus 1–3 digit octal (\012 -> \n, \101 -> A). \N{NAME} is unimplemented. The 200 KB Unicode-name database is too costly for the WASM artifact.
Errors anchor on the opening quote so the ^ marker points at the offender, not at end-of-line:
unterminated string literalunterminated triple-quoted string literalunterminated f-string literal
F-strings
F-strings decompose into a token sequence rather than a single String token. The parser consumes it directly:
f'a {x} b {y + 1}!'
FstringStart
FstringMiddle("a ")
Lbrace
Name(x)
Rbrace
FstringMiddle(" b ")
Lbrace
Name(y) Plus Int(1)
Rbrace
FstringMiddle("!")
FstringEndExpression tokens between { and } are emitted by the main lexer, not the f-string scanner. Interpolations get the full expression grammar, without special casing.
{{ and }} are escaped literal braces, with no Lbrace / Rbrace. They survive into FstringMiddle text and are unescaped by the parser.
Triple-quoted f-strings follow the same structure, with newlines embedded in middle segments. Nested f-strings are tracked via fstring_stack so each } resumes the right outer template. Nesting deeper than MAX_FSTRING_DEPTH = 200 raises f-string nesting depth exceeds maximum (200). EOF inside an open f-string raises unterminated f-string literal and synthesises a closing FstringEnd for a balanced sequence.
Indentation
Edge Python uses an INDENT/DEDENT model. The scanner tracks a stack of column counts and emits structural tokens at line boundaries:
| Situation | Tokens emitted |
|---|---|
| Blank line or comment-only line | Nl |
Inside (...), [...], {...} | Nl (no Indent / Dedent) |
| Indentation increased | Indent, Newline |
| Indentation decreased | Dedent (× n levels), Newline |
| Indentation unchanged | Newline |
| Dedent doesn’t match an outer level | diagnostic unindent does not match any outer indentation level |
| Mixed tabs and spaces in indent | Endmarker (lex halt) + diagnostic |
The nesting counter is bumped by (, [, { and decremented by ), ], }. While nesting > 0, line breaks emit Nl and the indent stack is frozen, allowing multi-line expressions inside brackets without spurious INDENT/DEDENT.
At EOF the lexer drains remaining levels off indent_stack for clean block closure, then emits Endmarker. Backslash line continuation (\ + newline) is consumed by the whitespace scanner, joining the two physical lines. ASCII bytes with no operator slot ($, ?, `) raise unexpected character and are skipped.
Soft-keyword disambiguation
type, match, and case are soft keywords. Each collides with a builtin or identifier use (type(), a function named match, a case variable), so the lexer disambiguates by peeking the next token:
type X = int # 'type' is a keyword (alias declaration)
type = None # 'type' is an identifier
match x: # 'match' is a keyword (statement)
match(s, p) # 'match' is an identifier (call)If the token following the word is one of (, :, =, ,, ), ], Newline, or EOF, it downgrades to Name. Otherwise it stays a keyword. A statement subject like match x: starts with a name or literal, so it keeps keyword force. A parenthesized subject like match (a, b): is the one case this heuristic misreads as a call.
_ always emits as Underscore. The parser distinguishes wildcard from name use grammatically.
Comments
# to end-of-line. Emitted as a Comment token (not discarded) so tools can round-trip source. The parser ignores Comment and Nl during peek().
Limits
Hard caps to prevent asymmetric DoS (small input exhausting memory/time). Hitting any halts lexing with Endmarker:
| Constant | Value | Purpose |
|---|---|---|
MAX_SOURCE_SIZE | 10 MiB | Reject oversized input upfront |
MAX_INDENT_DEPTH | 100 | Cap on the indentation stack |
MAX_FSTRING_DEPTH | 200 | Cap on nested f-string contexts |
These match the OWASP A04:2021 advice on resource exhaustion in interpreters.
Why offset-based tokens
A Token carries a kind tag plus three byte offsets:
pub struct Token {
pub kind: TokenType,
pub line: usize,
pub start: usize,
pub end: usize,
}The parser slices &source[t.start..t.end] lazily for identifier names, string content, or numeric literals. Result:
- Lexer never allocates a
Stringper identifier. lexeme(&t)is a zero-copy&strthat lives as long as the source buffer.- Diagnostics get exact byte offsets for free; error column is a single
rfind('\n').
References
- Python language reference, Lexical analysis: docs.python.org/3/reference/lexical_analysis
- OWASP, Insecure Compiler Optimization: owasp.org/www-community/vulnerabilities/Insecure_Compiler_Optimization
- Aho, Sethi & Ullman. Compilers: Principles, Techniques and Tools (1986). LUT-driven scanners.