Lexer Implementation Notes¶

This document provides implementation guidance for the Sharpy lexer. It covers state machine design, token generation, and edge case handling.

Lexer State Machine Overview¶

The Sharpy lexer operates as a finite state machine with the following primary states:

                    ┌─────────────────────────────────────────────────────┐
                    │                                                     │
                    ▼                                                     │
    ┌──────────┐  any    ┌──────────────┐                                │
    │  NORMAL  │◄────────│ LINE_START   │◄────────────────────────┐      │
    └────┬─────┘         └──────────────┘                         │      │
         │                      │                                  │      │
    ┌────┴────┬────────────────┬┴───────────┐                     │      │
    │         │                │            │                      │      │
    │ '"'     │ '#'            │ NEWLINE    │ f'"'                 │      │
    ▼         ▼                ▼            ▼                      │      │
┌───────┐ ┌───────┐     ┌──────────┐  ┌──────────┐                │      │
│STRING │ │COMMENT│     │LINE_START│  │ FSTRING  │────────────────┤      │
└───┬───┘ └───┬───┘     └──────────┘  └────┬─────┘                │      │
    │         │                            │                       │      │
    │ '"'     │ NEWLINE                    │ '{'                   │      │
    ▼         ▼                            ▼                       │      │
┌──────────────────┐                ┌─────────────┐                │      │
│      NORMAL      │◄───────────────│ FSTRING_EXPR│────────────────┘      │
└──────────────────┘    '}'         └─────────────┘                       │
                                           │                              │
                                           │ nested f'"'                  │
                                           ▼                              │
                                    ┌─────────────┐                       │
                                    │FSTRING_NESTED│───────────────────────┘
                                    └─────────────┘

State Descriptions¶

LINE_START¶

Entered at the beginning of each logical line. Responsible for:

Indentation measurement: Count leading spaces (tabs are errors)
INDENT/DEDENT emission: Compare with indentation stack
Blank line handling: Skip lines with only whitespace
Comment line handling: Comment-only lines don't affect indentation

Transitions:
  - Non-whitespace found → emit INDENT/DEDENT as needed → NORMAL
  - Comment found → COMMENT (after indentation check if code follows)
  - NEWLINE found → stay in LINE_START (blank line)
  - EOF → emit remaining DEDENTs → done

NORMAL¶

Primary lexing state for identifiers, operators, literals, and keywords.

Transitions:
  - '"' or "'" → STRING (with appropriate quote type)
  - f'"' or f"'" → FSTRING
  - '"""' or "'''" → MULTILINE_STRING
  - '#' → COMMENT
  - NEWLINE → emit NEWLINE token → LINE_START
  - Digit → NUMBER
  - Letter or '_' → IDENTIFIER
  - Operator chars → OPERATOR

STRING¶

Lexing a regular string literal.

State data:
  - quote_char: '"' or "'"
  - is_raw: bool (r-prefix)

Transitions:
  - quote_char → emit STRING token → NORMAL
  - '\\' (if not raw) → handle escape sequence
  - NEWLINE → error: "Unterminated string literal"
  - EOF → error: "Unterminated string literal"

FSTRING¶

Lexing an f-string with expression interpolation.

State data:
  - quote_char: '"' or "'"
  - brace_depth: int (for nested braces)
  - nesting_level: int (for nested f-strings)

Transitions:
  - '{' → FSTRING_EXPR (brace_depth++)
  - '{{' → literal '{' (stay in FSTRING)
  - quote_char → emit FSTRING_END → NORMAL or parent state
  - NEWLINE → error (for single-quoted)

FSTRING_EXPR¶

Inside an f-string expression ({...}).

State data:
  - parent_fstring_state
  - brace_depth: int

Transitions:
  - '}' → brace_depth--; if 0: → back to FSTRING
  - '{' → brace_depth++ (nested dict literal)
  - f'"' → FSTRING_NESTED (nested f-string)
  - All other tokens → normal tokenization

COMMENT¶

Processing a comment from # to end of line.

Transitions:
  - NEWLINE → emit COMMENT token (if needed) → LINE_START
  - EOF → emit COMMENT token → done

MULTILINE_STRING¶

Processing triple-quoted strings ("""...""" or '''...''').

State data:
  - quote_sequence: '"""' or "'''"
  - is_fstring: bool

Transitions:
  - Matching quote_sequence → emit STRING → NORMAL
  - NEWLINE → include in string (valid)
  - EOF → error: "Unterminated multi-line string"

Indentation Stack Algorithm¶

The lexer maintains an indentation stack to emit INDENT/DEDENT tokens:

# Pseudocode implementation
indent_stack = [0]  # Always starts with column 0

def process_line_start():
    spaces = count_leading_spaces()

    # Validate: must be multiple of 4
    if spaces % 4 != 0:
        error("Indentation must be a multiple of 4 spaces")

    current_indent = indent_stack[-1]

    if spaces > current_indent:
        # Must increase by exactly 4
        if spaces != current_indent + 4:
            error(f"Expected {current_indent + 4} spaces, found {spaces}")
        indent_stack.append(spaces)
        emit(INDENT)

    elif spaces < current_indent:
        # May decrease by multiple levels
        while indent_stack[-1] > spaces:
            indent_stack.pop()
            emit(DEDENT)

        # Must land on an existing level
        if indent_stack[-1] != spaces:
            error(f"Unindent does not match any outer indentation level")

    # else: same indentation, no tokens

Token Types¶

The lexer produces the following token categories:

Structural Tokens¶

Token	Description
`NEWLINE`	Logical line ending (not inside brackets/parens)
`INDENT`	Indentation increase
`DEDENT`	Indentation decrease
`EOF`	End of file

Literal Tokens¶

Token	Examples
`INT_LITERAL`	`42`, `0xFF`, `0b1010`, `1_000_000`
`FLOAT_LITERAL`	`3.14`, `1e10`, `2.5f`
`STRING_LITERAL`	`"hello"`, `'world'`, `"""multi"""`
`FSTRING_START`	`f"` (begins f-string)
`FSTRING_MIDDLE`	Text between expressions in f-string
`FSTRING_END`	Closing quote of f-string
`TRUE`	`True`
`FALSE`	`False`
`NONE`	`None`

Identifier/Keyword Tokens¶

Token	Description
`IDENTIFIER`	Variable/function/class names. Backtick-escaped identifiers (`class`) also use this token type with `IsBacktickEscaped = true`.
Per-keyword tokens	Each keyword has its own token type (`Def`, `Class`, `If`, etc.) rather than a generic `KEYWORD` type.

Operator Tokens¶

Token	Symbols
`PLUS`, `MINUS`, `STAR`, `SLASH`	`+`, `-`, `*`, `/`
`DOUBLE_STAR`	`**`
`DOUBLE_SLASH`	`//`
`PERCENT`	`%`
`AMPERSAND`, `PIPE`, `CARET`	`&`, `\\|`, `^`
`TILDE`	`~`
`LSHIFT`, `RSHIFT`	`<<`, `>>`
`LT`, `LE`, `GT`, `GE`	`<`, `<=`, `>`, `>=`
`EQ`, `NE`	`==`, `!=`
`ASSIGN`	`=`
`PLUS_ASSIGN`, `MINUS_ASSIGN`, etc.	`+=`, `-=`, `*=`, etc. (each augmented assign has its own token type)
`WALRUS`	`:=`
`ARROW`	`->`
`PIPE_ARROW`	`\\|>`
`DOT`, `QUESTION_DOT`	`.`, `?.`
`DOUBLE_QUESTION`	`??`
`COLON`	`:`
`COMMA`	`,`
`SEMICOLON`	`;` (for inline compound statements)

Delimiter Tokens¶

Token	Symbols
`LPAREN`, `RPAREN`	`(`, `)`
`LBRACKET`, `RBRACKET`	`[`, `]`
`LBRACE`, `RBRACE`	`{`, `}`

Bracket Depth Tracking¶

NEWLINEs inside brackets/parentheses are treated as whitespace, not line terminators:

# Pseudocode
bracket_depth = 0

def handle_open_bracket():
    bracket_depth += 1
    emit(LPAREN/LBRACKET/LBRACE)

def handle_close_bracket():
    bracket_depth -= 1
    emit(RPAREN/RBRACKET/RBRACE)

def handle_newline():
    if bracket_depth == 0:
        emit(NEWLINE)
        transition_to(LINE_START)
    else:
        # Implicit line continuation, skip
        pass

Lookahead Requirements¶

Some tokens require lookahead to disambiguate:

Context	Lookahead	Tokens
`<`	1 char	`<` vs `<=` vs `<<`
`>`	1 char	`>` vs `>=` vs `>>`
`*`	1 char	`` vs `*`
`/`	1 char	`/` vs `//`
`:`	1 char	`:` vs `:=`
`?`	1 char	`?` vs `?.` vs `??`
`.`	1 char	`.` vs `...` (ellipsis)
`\\|`	1 char	`\\|` vs `\\|>`
`0`	1 char	`0` (decimal) vs `0x`, `0b`, `0o`
`"`	2 chars	`"` vs `"""`
String prefix	1-2 chars	`r"`, `f"`, `rf"`, `b"`, `d"`, `dr"`, `df"`, `t"`

Error Recovery in Lexer¶

The lexer should attempt to recover from errors to report multiple issues:

Unterminated string: Close at EOL/EOF, report error, continue
Invalid character: Skip, report error, continue
Tab in indentation: Report error, treat as 4 spaces, continue
Invalid escape sequence: Include literally, report warning, continue
Invalid numeric literal: Tokenize what's valid, report error

# Example: invalid character recovery
def handle_invalid_char(c):
    report_error(f"Invalid character: {repr(c)}")
    advance()  # Skip the invalid character
    # Continue lexing from next character

Position Tracking¶

Each token should include:

Token {
    type: TokenType
    value: string           # Raw text of token
    line: int               # 1-indexed line number
    column: int             # 1-indexed column (in characters)
    position: int           # Character offset in source (0-indexed)
    length: int             # Computed property: length in characters (not bytes)
}

For error messages, positions should be reported as line:column (both 1-indexed for display).

Implementation Notes¶

UTF-8 handling: Source files are UTF-8. Column positions should account for multi-byte characters.
BOM handling: Skip UTF-8 BOM (0xEF 0xBB 0xBF) if present at start of file.
Line endings: Accept \n, \r\n, or \r. Normalize to \n internally.
Maximum nesting: Limit f-string nesting depth (e.g., 4 levels) to prevent stack overflow.
Token buffer: For lookahead, buffer up to 2 tokens rather than peeking into source repeatedly.
Interning: Consider interning identifier strings for faster comparison in later phases.

Implementation - The actual Sharpy lexer is in src/Sharpy.Compiler/Lexer/. This document describes the specification; implementation may differ in details.