Canonical Equivalence and Set<String> in Swift
How Unicode normalization affects Set's hashing and equality
When you create a Set of strings in Swift, you probably expect unique values to collapse into a single entry. Whatβs easy to miss is that this behavior depends on Unicode normalizationβand Swift makes a very specific choice for you.
Even though a and b are built from different Unicode scalar sequences, Swift treats them as equal. Thatβs because Swift normalizes strings for equality and hashing. If two strings are canonically equivalent in Unicode, they are considered the same value when inserted into a Set or used as dictionary keys.
This is a deliberate design choice. Swift optimizes for human-perceived equality, not raw binary representation. From a userβs point of view, βΓ©β is βΓ©β, regardless of how it was typed, copied, or transmitted.
The subtlety is that this behavior is invisible until it matters. It shows up when you deduplicate user input, process international text, or handle data coming from multiple systems. If you assume that different scalar sequences always produce different values, your mental model of Set is already wrong.
import Foundation
// ORIGINAL JSON payload (VALID JSON)
// Two canonically equivalent strings encoded differently
let originalJSON = """
{
"tags": ["\\u00E0", "a\\u0300"]
}
""".data(using: .utf8)!
struct Payload: Codable {
let tags: [String]
}
// 1οΈβ£ Decode original JSON
let decoded = try JSONDecoder().decode(Payload.self, from: originalJSON)
print("Original decoded JSON:")
print(decoded.tags)
print("count =", decoded.tags.count) // β
2
print()
// 2οΈβ£ Inspect internal representations
print("Decoded strings:")
for tag in decoded.tags {
print(
"\"\(tag)\" scalars:", tag.unicodeScalars.count,
"characters:", tag.count
)
}
print()
// 3οΈβ£ Convert Array β Set
// ISSUE:
// Set<String> collapses canonically equivalent strings.
// Information is lost here.
let rawSet = Set(decoded.tags)
print("After Set conversion (DATA LOSS):")
print("count =", rawSet.count) // β 1 (was 2)
print(rawSet)
print()
// 4οΈβ£ Re-encode after Set
// This JSON is NOT equivalent to the original input.
let reencoded = try JSONEncoder().encode(
Payload(tags: Array(rawSet))
)
print("Re-encoded JSON (information lost):")
print(String(data: reencoded, encoding: .utf8)!)Output:
Original decoded JSON:
["Γ ", "aΜ"]
count = 2
Decoded strings:
"Γ " scalars: 1 characters: 1
"aΜ" scalars: 2 characters: 1
After Set conversion (DATA LOSS):
count = 1
["Γ "]
Re-encoded JSON (information lost):
{"tags":["Γ "]}The key takeaway: in Swift, string equality is semantic, not structural. Unicode normalization, for some operations, is baked into the language.
This is one of the edge cases I cover in my Unicode training.


