String Type and UTF-16 Semantics¶

Sharpy's str type maps directly to System.String (C# string). Python-compatible string methods (upper(), find(), split(), etc.) are provided as extension methods on string via Sharpy.StringExtensions. Operations that System.String doesn't natively support (repetition, negative indexing, iteration as single-character strings) use static helper methods in Sharpy.StringHelpers.

s: str = "hello"       # Type is System.String (C# string)

This design follows the Kotlin model — Kotlin's String is java.lang.String with extension functions — and aligns with all three Sharpy axioms:

Axiom 1 (.NET): string is the native .NET type. Zero interop friction.
Axiom 2 (Python): Extension methods provide s.upper(), s.find(), etc. — same surface as Python.
Axiom 3 (Type Safety): No implicit conversions, no boxing, no overload ambiguity.

Historical note: Sharpy originally used a Sharpy.Str readonly struct wrapper. This was removed — see SRP-0007 for rationale.

UTF-16 Code Units¶

All string operations in Sharpy work with UTF-16 code units, not Unicode code points or grapheme clusters. This matches C# behavior exactly.

len() returns UTF-16 code units:

# ASCII characters: 1 code unit each
len("hello")        # 5

# Most common characters: 1 code unit each
len("café")         # 4 (é is a single code unit U+00E9)

# Emoji and rare characters: 2 code units (surrogate pairs)
len("😀")           # 2 (U+1F600 requires surrogate pair)
len("𝄞")            # 2 (musical G clef, U+1D11E)

# Combined
len("Hi 😀!")       # 6 (H=1, i=1, space=1, 😀=2, !=1)

Indexing returns UTF-16 code units:

s = "hello"
s[0]               # 'h'
s[4]               # 'o'

# With emoji
s = "Hi 😀!"
s[0]               # 'H'
s[3]               # '\uD83D' (high surrogate of 😀)
s[4]               # '\uDE00' (low surrogate of 😀)
s[5]               # '!'

Slicing operates on UTF-16 code units:

s = "café"
s[0:4]             # "café"
s[0:3]             # "caf"

# Slicing through a surrogate pair can produce invalid strings
s = "A😀B"
s[0:2]             # "A\uD83D" - contains unpaired surrogate (may cause issues)
s[0:3]             # "A😀" - correct

Comparison with Python¶

Operation	Python 3	Sharpy / C#
`len("😀")`	1 (code point)	2 (UTF-16 code units)
`"😀"[0]`	'😀' (full character)	'\uD83D' (high surrogate)
Internal encoding	Flexible (Latin-1/UCS-2/UCS-4)	Always UTF-16
Iteration unit	Code points	UTF-16 code units

Iterating Over Strings¶

Iterating over a string yields single-character str values (one UTF-16 code unit each), via StringHelpers.Iterate():

for c in "Hi😀":
    print(c)
# Output:
# H
# i
# � (high surrogate)
# � (low surrogate)

Each iteration variable c is a str (not a char), matching Python's behavior where iterating a string yields single-character strings.

Working with Unicode Correctly¶

For applications that need to work with user-perceived characters (grapheme clusters) or Unicode code points, use the appropriate .NET APIs:

from system.globalization import StringInfo

# Get grapheme clusters (user-perceived characters)
text = "café"  # 'e' + combining acute accent (if composed that way)
info = StringInfo(text)
length_in_graphemes = info.length_in_text_elements

# Enumerate code points
from system.text import Rune

for rune in text.enumerate_runes():
    print(rune)

Note: A dedicated grapheme cluster module for Sharpy is planned for a future version.

String Literals and Escapes¶

String literals in source code are UTF-8 encoded (per Sharpy's source file encoding), but are converted to UTF-16 System.String values at compile time:

# All produce valid UTF-16 strings
ascii_str = "hello"
unicode_str = "héllo wörld"
emoji_str = "Hello 😀 World"
escape_str = "\u0048\u0065\u006C\u006C\u006F"  # "Hello"

String Method Availability¶

Sharpy provides Python-compatible string methods as extension methods on string in Sharpy.StringExtensions. The compiler's NameMangler converts snake_case method names to PascalCase (e.g., upper → Upper), and generated code includes using global::Sharpy; to bring these extensions into scope.

Pythonic String Methods (Extension Methods)¶

Sharpy Method	Extension Method	Notes
`s.upper()`	`s.Upper()`	Uppercase (invariant culture)
`s.lower()`	`s.Lower()`	Lowercase (invariant culture)
`s.strip()`	`s.Strip()`	Remove leading/trailing whitespace
`s.lstrip()`	`s.Lstrip()`	Remove leading whitespace
`s.rstrip()`	`s.Rstrip()`	Remove trailing whitespace
`s.startswith(prefix)`	`s.Startswith(prefix)`	Check prefix
`s.endswith(suffix)`	`s.Endswith(suffix)`	Check suffix
`s.find(sub)`	`s.Find(sub)`	Find substring (returns -1 if not found)
`s.rfind(sub)`	`s.Rfind(sub)`	Find last occurrence
`s.replace(old, new)`	`s.Replace(old, new)`	Replace all occurrences
`s.split()`	`s.Split()`	Split on whitespace
`s.split(sep)`	`s.Split(sep)`	Split on separator
`s.join(items)`	`s.Join(items)`	Join with separator
`s.count(sub)`	`s.Count(sub)`	Count occurrences
`s.isdigit()`	`s.Isdigit()`	Check if all digits
`s.isalpha()`	`s.Isalpha()`	Check if all alphabetic
`s.isalnum()`	`s.Isalnum()`	Check if alphanumeric
`s.isspace()`	`s.Isspace()`	Check if all whitespace
`s.casefold()`	`s.Casefold()`	Full Unicode case folding

.NET Methods (Direct Access)¶

Since str is System.String, all .NET string methods are directly available:

s = "Hello, World!"

# .NET methods work directly
s.Contains("World")            # True
s.Substring(0, 5)              # "Hello"
s.PadLeft(20)                  # "       Hello, World!"
s.Insert(7, "Beautiful ")      # "Hello, Beautiful World!"

Method Resolution¶

When both a Sharpy extension method and a .NET method could apply, the Sharpy extension method takes precedence via the compiler's name mangling:

s.upper()    # Mangled to s.Upper() — calls Sharpy extension method

Differences from Python¶

Some Python string methods have slightly different behavior due to .NET semantics:

Operation	Python	Sharpy/.NET
`"ab" * 3`	`"ababab"`	`"ababab"` (✅ same)
`s.split()`	Splits on any whitespace	Splits on whitespace (✅ same)
`s.split("")`	`ValueError`	`ValueError` (✅ same)
`s.count(sub)`	Count non-overlapping	Count non-overlapping (✅ same)
`s[::2]`	Every other char	Slice syntax supported

To split a string into individual characters in Sharpy, use:

chars = list("hello")           # ['h', 'e', 'l', 'l', 'o']
# or
chars = [c for c in "hello"]    # ['h', 'e', 'l', 'l', 'o']

Implications for Sharpy Developers¶

String length may differ from character count: len() returns UTF-16 code units, which may be more than the number of visible characters for strings containing emoji or rare Unicode characters.
Indexing can split surrogate pairs: Be cautious when indexing or slicing strings that may contain characters outside the Basic Multilingual Plane (BMP).
Use .NET APIs for Unicode-aware operations: When correctness with all Unicode text is required, use StringInfo, Rune, or other .NET globalization APIs.
Most common text works as expected: ASCII text and most European/Asian scripts (within the BMP) have a 1:1 correspondence between characters and code units.

Implementation - ✅ str maps to System.String; Python methods via Sharpy.StringExtensions; operators/indexing/iteration via Sharpy.StringHelpers.