String Type and UTF-16 Semantics¶
Sharpy's str type maps directly to System.String (C# string). Python-compatible string methods (upper(), find(), split(), etc.) are provided as extension methods on string via Sharpy.StringExtensions. Operations that System.String doesn't natively support (repetition, negative indexing, iteration as single-character strings) use static helper methods in Sharpy.StringHelpers.
This design follows the Kotlin model — Kotlin's String is java.lang.String with extension functions — and aligns with all three Sharpy axioms:
- Axiom 1 (.NET):
stringis the native .NET type. Zero interop friction. - Axiom 2 (Python): Extension methods provide
s.upper(),s.find(), etc. — same surface as Python. - Axiom 3 (Type Safety): No implicit conversions, no boxing, no overload ambiguity.
Historical note: Sharpy originally used a
Sharpy.Strreadonly struct wrapper. This was removed — see SRP-0007 for rationale.
UTF-16 Code Units¶
All string operations in Sharpy work with UTF-16 code units, not Unicode code points or grapheme clusters. This matches C# behavior exactly.
len() returns UTF-16 code units:
# ASCII characters: 1 code unit each
len("hello") # 5
# Most common characters: 1 code unit each
len("café") # 4 (é is a single code unit U+00E9)
# Emoji and rare characters: 2 code units (surrogate pairs)
len("😀") # 2 (U+1F600 requires surrogate pair)
len("𝄞") # 2 (musical G clef, U+1D11E)
# Combined
len("Hi 😀!") # 6 (H=1, i=1, space=1, 😀=2, !=1)
Indexing returns UTF-16 code units:
s = "hello"
s[0] # 'h'
s[4] # 'o'
# With emoji
s = "Hi 😀!"
s[0] # 'H'
s[3] # '\uD83D' (high surrogate of 😀)
s[4] # '\uDE00' (low surrogate of 😀)
s[5] # '!'
Slicing operates on UTF-16 code units:
s = "café"
s[0:4] # "café"
s[0:3] # "caf"
# Slicing through a surrogate pair can produce invalid strings
s = "A😀B"
s[0:2] # "A\uD83D" - contains unpaired surrogate (may cause issues)
s[0:3] # "A😀" - correct
Comparison with Python¶
| Operation | Python 3 | Sharpy / C# |
|---|---|---|
len("😀") |
1 (code point) | 2 (UTF-16 code units) |
"😀"[0] |
'😀' (full character) | '\uD83D' (high surrogate) |
| Internal encoding | Flexible (Latin-1/UCS-2/UCS-4) | Always UTF-16 |
| Iteration unit | Code points | UTF-16 code units |
Iterating Over Strings¶
Iterating over a string yields single-character str values (one UTF-16 code unit each), via StringHelpers.Iterate():
Each iteration variable c is a str (not a char), matching Python's behavior where iterating a string yields single-character strings.
Working with Unicode Correctly¶
For applications that need to work with user-perceived characters (grapheme clusters) or Unicode code points, use the appropriate .NET APIs:
from system.globalization import StringInfo
# Get grapheme clusters (user-perceived characters)
text = "café" # 'e' + combining acute accent (if composed that way)
info = StringInfo(text)
length_in_graphemes = info.length_in_text_elements
# Enumerate code points
from system.text import Rune
for rune in text.enumerate_runes():
print(rune)
Note: A dedicated grapheme cluster module for Sharpy is planned for a future version.
String Literals and Escapes¶
String literals in source code are UTF-8 encoded (per Sharpy's source file encoding), but are converted to UTF-16 System.String values at compile time:
# All produce valid UTF-16 strings
ascii_str = "hello"
unicode_str = "héllo wörld"
emoji_str = "Hello 😀 World"
escape_str = "\u0048\u0065\u006C\u006C\u006F" # "Hello"
String Method Availability¶
Sharpy provides Python-compatible string methods as extension methods on string in Sharpy.StringExtensions. The compiler's NameMangler converts snake_case method names to PascalCase (e.g., upper → Upper), and generated code includes using global::Sharpy; to bring these extensions into scope.
Pythonic String Methods (Extension Methods)¶
| Sharpy Method | Extension Method | Notes |
|---|---|---|
s.upper() |
s.Upper() |
Uppercase (invariant culture) |
s.lower() |
s.Lower() |
Lowercase (invariant culture) |
s.strip() |
s.Strip() |
Remove leading/trailing whitespace |
s.lstrip() |
s.Lstrip() |
Remove leading whitespace |
s.rstrip() |
s.Rstrip() |
Remove trailing whitespace |
s.startswith(prefix) |
s.Startswith(prefix) |
Check prefix |
s.endswith(suffix) |
s.Endswith(suffix) |
Check suffix |
s.find(sub) |
s.Find(sub) |
Find substring (returns -1 if not found) |
s.rfind(sub) |
s.Rfind(sub) |
Find last occurrence |
s.replace(old, new) |
s.Replace(old, new) |
Replace all occurrences |
s.split() |
s.Split() |
Split on whitespace |
s.split(sep) |
s.Split(sep) |
Split on separator |
s.join(items) |
s.Join(items) |
Join with separator |
s.count(sub) |
s.Count(sub) |
Count occurrences |
s.isdigit() |
s.Isdigit() |
Check if all digits |
s.isalpha() |
s.Isalpha() |
Check if all alphabetic |
s.isalnum() |
s.Isalnum() |
Check if alphanumeric |
s.isspace() |
s.Isspace() |
Check if all whitespace |
s.casefold() |
s.Casefold() |
Full Unicode case folding |
.NET Methods (Direct Access)¶
Since str is System.String, all .NET string methods are directly available:
s = "Hello, World!"
# .NET methods work directly
s.Contains("World") # True
s.Substring(0, 5) # "Hello"
s.PadLeft(20) # " Hello, World!"
s.Insert(7, "Beautiful ") # "Hello, Beautiful World!"
Method Resolution¶
When both a Sharpy extension method and a .NET method could apply, the Sharpy extension method takes precedence via the compiler's name mangling:
Differences from Python¶
Some Python string methods have slightly different behavior due to .NET semantics:
| Operation | Python | Sharpy/.NET |
|---|---|---|
"ab" * 3 |
"ababab" |
"ababab" (✅ same) |
s.split() |
Splits on any whitespace | Splits on whitespace (✅ same) |
s.split("") |
ValueError |
ValueError (✅ same) |
s.count(sub) |
Count non-overlapping | Count non-overlapping (✅ same) |
s[::2] |
Every other char | Slice syntax supported |
To split a string into individual characters in Sharpy, use:
chars = list("hello") # ['h', 'e', 'l', 'l', 'o']
# or
chars = [c for c in "hello"] # ['h', 'e', 'l', 'l', 'o']
Implications for Sharpy Developers¶
-
String length may differ from character count:
len()returns UTF-16 code units, which may be more than the number of visible characters for strings containing emoji or rare Unicode characters. -
Indexing can split surrogate pairs: Be cautious when indexing or slicing strings that may contain characters outside the Basic Multilingual Plane (BMP).
-
Use .NET APIs for Unicode-aware operations: When correctness with all Unicode text is required, use
StringInfo,Rune, or other .NET globalization APIs. -
Most common text works as expected: ASCII text and most European/Asian scripts (within the BMP) have a 1:1 correspondence between characters and code units.
Implementation
- ✅ str maps to System.String; Python methods via Sharpy.StringExtensions; operators/indexing/iteration via Sharpy.StringHelpers.