Documentation Index
Fetch the complete documentation index at: https://edgepython.com/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Edge Python uses a hand-written, LUT-driven scanner that walks the source as raw bytes and produces a stream ofToken { kind, line, start, end }. The scanner is offset-based: tokens carry byte indices into the source buffer, never their own copies of the text. This is enough for diagnostics, debug output, and the parser’s lazy slicing of identifier and string content.
Lexing runs in linear time O(n) with constant-time per-byte branchless dispatch through two lookup tables.
Token kinds
The token set tracks Python 3.13.12 closely. Categories implemented:- Keywords:
False,None,True,and,as,assert,async,await,break,class,continue,def,del,elif,else,except,finally,for,from,global,if,import,in,is,lambda,nonlocal,not,or,pass,raise,return,try,while,with,yield. - Soft keywords:
case,match,type,_. Resolved contextually (see below). - Operators: 1-, 2-, and 3-character operator forms (
+,==,**=,//=, etc.). - Delimiters:
( ) [ ] { } : , ; .. - Literals:
Name,Int,Float,Complex,String. - F-string segments:
FstringStart,FstringMiddle,FstringEnd. - Whitespace and structure:
Comment,Newline,Indent,Dedent,Nl,Endmarker.
Dispatch tables
The lexer hot loop avoids per-byte branching through two compile-time tables inlexer/tables.rs:
scan_while(pred) driver that loops over BYTE_CLASS[b] & FLAG. Single-character operators do b -> SINGLE_TOK[b] -> SINGLE_MAP[i] — two indexed loads, no branches.
The keyword lookup is routed by (length, first_byte) to skip most memcmps. Most keyword candidates terminate after a single match arm.
Numeric literals
j / J for complex literals.
String prefixes
is_string_prefix / is_fstring_prefix. Triple-quoted strings span newlines and bump line for each \n inside.
F-strings
F-strings are decomposed into a sequence of tokens rather than being represented by a singleString token. The parser consumes the sequence directly:
{ and } are emitted by the main lexer, not the f-string scanner, which means the full expression grammar is available inside interpolations without special casing.
{{ and }} are treated as escaped literal braces and produce no Lbrace / Rbrace. They survive into the FstringMiddle text and are unescaped by the parser.
Triple-quoted f-strings (f"""...""") follow the same structure with newlines embedded in the middle segments.
Indentation
Edge Python uses CPython’s INDENT/DEDENT model. The scanner tracks a stack of column counts and emits structural tokens at line boundaries:| Situation | Tokens emitted |
|---|---|
| Blank line or comment-only line | Nl |
Inside (...), [...], {...} | Nl (no INDENT/DEDENT) |
| Indentation increased | Newline, Indent |
| Indentation decreased | Newline, Dedent (× n levels) |
| Indentation unchanged | Newline |
| Mixed tabs and spaces in indent | Endmarker (lex halt) |
nesting counter is bumped by (, [, { and decremented by ), ]. While nesting > 0, line breaks emit Nl and the indent stack is frozen — this is what allows multi-line expressions inside brackets without spurious INDENT/DEDENT.
Soft-keyword disambiguation
match, case, and type are keywords in some positions and identifiers in others. The lexer resolves the ambiguity by peeking at the next token:
match / case / type is one of (, ), ], :, =, ,, Newline, or EOF, the soft keyword is downgraded to Name. Otherwise it stays a keyword.
The same logic applies to _. In case _: it’s the wildcard Underscore; in _ = compute() it’s a Name.
Comments
# to end-of-line. Comments are emitted as a Comment token rather than discarded — this is what allows tools to round-trip source. The parser ignores Comment and Nl tokens during peek().
Limits
To prevent asymmetric DoS attacks (small input that exhausts memory or time), the lexer enforces hard caps. Going past any of these halts lexing withEndmarker:
| Constant | Value | Purpose |
|---|---|---|
MAX_SOURCE_SIZE | 10 MiB | Reject oversized input upfront |
MAX_INDENT_DEPTH | 100 | Cap on the indentation stack |
MAX_FSTRING_DEPTH | 200 | Cap on nested f-string contexts |
Why offset-based tokens
AToken is 32 bytes:
&source[t.start..t.end] lazily when it needs the lexeme — for identifier names, string content, or numeric literals. This means:
- The lexer never allocates a
Stringper identifier. - The parser’s
lexeme(&t)is a zero-copy&strthat lives as long as the source buffer. - Diagnostics get exact byte offsets for free, which makes error column computation a single
rfind('\n').
References
- Python language reference, Lexical analysis: docs.python.org/3/reference/lexical_analysis
- OWASP, Insecure Compiler Optimization: owasp.org/www-community/vulnerabilities/Insecure_Compiler_Optimization
- Aho, Sethi & Ullman. Compilers: Principles, Techniques and Tools (1986). LUT-driven scanners.