Canonical Equivalence and Set<String> in Swift

How Unicode normalization affects Set's hashing and equality

Jan 30, 2026

When you create a Set of strings in Swift, you probably expect unique values to collapse into a single entry. What’s easy to miss is that this behavior depends on Unicode normalization—and Swift makes a very specific choice for you.

Even though a and b are built from different Unicode scalar sequences, Swift treats them as equal. That’s because Swift normalizes strings for equality and hashing. If two strings are canonically equivalent in Unicode, they are considered the same value when inserted into a Set or used as dictionary keys.

This is a deliberate design choice. Swift optimizes for human-perceived equality, not raw binary representation. From a user’s point of view, “é” is “é”, regardless of how it was typed, copied, or transmitted.

The subtlety is that this behavior is invisible until it matters. It shows up when you deduplicate user input, process international text, or handle data coming from multiple systems. If you assume that different scalar sequences always produce different values, your mental model of Set is already wrong.

import Foundation

// ORIGINAL JSON payload (VALID JSON)
// Two canonically equivalent strings encoded differently
let originalJSON = """
{
  "tags": ["\\u00E0", "a\\u0300"]
}
""".data(using: .utf8)!

struct Payload: Codable {
    let tags: [String]
}

// 1️⃣ Decode original JSON
let decoded = try JSONDecoder().decode(Payload.self, from: originalJSON)

print("Original decoded JSON:")
print(decoded.tags)
print("count =", decoded.tags.count) // ✅ 2
print()

// 2️⃣ Inspect internal representations
print("Decoded strings:")
for tag in decoded.tags {
    print(
        "\"\(tag)\" scalars:", tag.unicodeScalars.count,
        "characters:", tag.count
    )
}
print()

// 3️⃣ Convert Array → Set
// ISSUE:
// Set<String> collapses canonically equivalent strings.
// Information is lost here.
let rawSet = Set(decoded.tags)

print("After Set conversion (DATA LOSS):")
print("count =", rawSet.count) // ❌ 1 (was 2)
print(rawSet)
print()

// 4️⃣ Re-encode after Set
// This JSON is NOT equivalent to the original input.
let reencoded = try JSONEncoder().encode(
    Payload(tags: Array(rawSet))
)

print("Re-encoded JSON (information lost):")
print(String(data: reencoded, encoding: .utf8)!)

Output:

Original decoded JSON:
["à", "à"]
count = 2
Decoded strings:
"à" scalars: 1 characters: 1
"à" scalars: 2 characters: 1
After Set conversion (DATA LOSS):
count = 1
["à"]
Re-encoded JSON (information lost):
{"tags":["à"]}

The key takeaway: in Swift, string equality is semantic, not structural. Unicode normalization, for some operations, is baked into the language.

This is one of the edge cases I cover in my Unicode training.

Swift with Blaze

Discussion about this post

Ready for more?