What Does It Mean That String and Character Comparisons in Swift Are Not Locale-Sensitive

What does it mean that string and character comparisons in Swift are not locale-sensitive?

(All code examples updated for Swift 3 now.)

Comparing Swift strings with < does a lexicographical comparison
based on the so-called "Unicode Normalization Form D" (which can be computed with
decomposedStringWithCanonicalMapping)

For example, the decomposition of

"ä" = U+00E4 = LATIN SMALL LETTER A WITH DIAERESIS

is the sequence of two Unicode code points

U+0061,U+0308 = LATIN SMALL LETTER A + COMBINING DIAERESIS

For demonstration purposes, I have written a small String extension which dumps the
contents of the String as an array of Unicode code points:

extension String {
var unicodeData : String {
return self.unicodeScalars.map {
String(format: "%04X", $0.value)
}.joined(separator: ",")
}
}

Now lets take some strings, sort them with <:

let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()
print(someStrings)
// ["a", "ã", "ă", "ä", "ǟ", "b"]

and dump the Unicode code points of each string (in original and decomposed
form) in the sorted array:

for str in someStrings {
print("\(str) \(str.unicodeData) \(str.decomposedStringWithCanonicalMapping.unicodeData)")
}

The output

äx  00E4,0078  0061,0308,0078
ǟx 01DF,0078 0061,0308,0304,0078
ǟψ 01DF,03C8 0061,0308,0304,03C8
äψ 00E4,03C8 0061,0308,03C8

nicely shows that the comparison is done by a lexicographic ordering of the Unicode
code points in the decomposed form.

This is also true for strings of more than one character, as the following example
shows. With

let someStrings = ["ǟψ", "äψ", "ǟx", "äx"].sorted()

the output of above loop is

äx  00E4,0078  0061,0308,0078
ǟx 01DF,0078 0061,0308,0304,0078
ǟψ 01DF,03C8 0061,0308,0304,03C8
äψ 00E4,03C8 0061,0308,03C8

which means that

"äx" < "ǟx", but "äψ" > "ǟψ"

(which was at least unexpected for me).

Finally let's compare this with a locale-sensitive ordering, for example swedish:

let locale = Locale(identifier: "sv") // svenska
var someStrings = ["ǟ", "ä", "ã", "a", "ă", "b"]
someStrings.sort {
$0.compare($1, locale: locale) == .orderedAscending
}

print(someStrings)
// ["a", "ă", "ã", "b", "ä", "ǟ"]

As you see, the result is different from the Swift < sorting.

How does the Swift string more than operator work

I believe javascript uses exactly the same string comparison approach, and the same syntax. In javascript you could also use localeCompare(). And in swift you could alternatively use localizedCompare(_:) (or one of the other string comparison functions). They're all different ways, and with different options, to alphabetically compare strings.

String comparison in Swift is not transitive

It looks like this is not supposed to happen:

Q: Is transitive consistency maintained by the [Unicode Collation Algorithm]?

A: Yes, for any strings A, B, and C, if A < B and B < C, then A < C. However, implementers must be careful to produce implementations that accurately reproduce the results of the Unicode Collation Algorithm as they optimize their own algorithms. It is easy to perform careless optimizations — especially with Incremental Comparison algorithms — that fail this test. Other items to check are the proper distinction between the bases of accents. For example, the sequence <u-macron, u-diaeresis-macron> should compare as less than <u-macron-diaeresis, u-macron>; this is a secondary distinction, based on the weighting of the accents, which must be correctly associated with the primary weights of their respective base letters.

(Source: Unicode Collation FAQ)

In the UnicodeNormalization.cpp file, ucol_strcoll and ucol_strcollIter are called, which are part of the ICU project. This may be a bug in the Swift standard library or the ICU project.
I reported this issue to the Swift Bug Tracker.

String comparison () returns different results on different platforms?

This is a known open "bug" (or perhaps rather a known limitation):

  • SR-530 - [String] sort order varies on Darwin vs. Linux

Quoting Dave Abrahams' comment to the open bug report:

This will mostly be fixed by the new string work, wherein String's
default sort order will be implemented as a lexicographical ordering
of FCC-normalized UTF16 code units.

Note that on both platforms we rely on ICU for normalization services,
and normalization differences among different implementations of ICU
are a real possibility, so there will never be a guarantee that two
arbitrary strings sort the same on both platforms.

However, for Latin-1 strings such as those in the example, the new
work will fix the problem.

Moreover, from The String Manifest:

Comparing and Hashing Strings

...

Following this scheme everywhere would also allow us to make sorting
behavior consistent across platforms. Currently, we sort String
according to the UCA, except that--only on Apple platforms--pairs of
ASCII characters are ordered by unicode scalar value
.

Most likely, the particular example of the OP (covering solely ASCII characters), comparison according to UCA (Unicode Collation Algorithm) is used for Linux platforms, whereas on Apple platforms, the sorting of these single ASCII character String's (or; String instances starting with ASCII characters) is according to unicode scalar value.

// ASCII value
print("S".unicodeScalars.first!.value) // 83
print("g".unicodeScalars.first!.value) // 103

// Unicode scalar value
print(String(format: "%04X", "S".unicodeScalars.first!.value)) // 0053
print(String(format: "%04X", "g".unicodeScalars.first!.value)) // 0067

print("S" < "g") // 'true' on Apple platforms (comparison by unicode scalar value),
// 'false' on Linux platforms (comparison according to UCA)

See also the excellent accepted answer to the following Q&A:

  • What does it mean that string and character comparisons in Swift are not locale-sensitive?

How to compare two strings ignoring case in Swift language?

Try this:

var a = "Cash"
var b = "cash"
let result: NSComparisonResult = a.compare(b, options: NSStringCompareOptions.CaseInsensitiveSearch, range: nil, locale: nil)

// You can also ignore last two parameters(thanks 0x7fffffff)
//let result: NSComparisonResult = a.compare(b, options: NSStringCompareOptions.CaseInsensitiveSearch)

result is type of NSComparisonResult enum:

enum NSComparisonResult : Int {

case OrderedAscending
case OrderedSame
case OrderedDescending
}

So you can use if statement:

if result == .OrderedSame {
println("equal")
} else {
println("not equal")
}

swift string diacriticInsensitive not working correct

This precisely matches the meaning of diacriticInsensitive. UTR #30 covers this. "Diacritic removal" includes "stroke, hook, descender" and all other "diacritics" returning the "related base character." While in Swedish å is considered a separate letter for sorting purposes, it still has a "base character" of (Latin) a. (Similarly for ä and ö.) This is a complex problem in Swedish, but the results should not be surprising.

The ultimate rules are in Unicode's DiacriticFolding. These rules are not locale specific. It's possible that Foundation applies some additional locale rules, but clearly not in this case. The relevant Unicode folding rule is:

0061 030A;  0061    # å → a LATIN SMALL LETTER A, COMBINING RING ABOVE → LATIN SMALL LETTER A

Many cultures have subtle definitions of what is "a letter" vs "an extension of another letter" vs "a half-letter" vs "a non-letter symbol." When computing diacritics, the Turkish "İ" has a base form of "I", but "i" does not have a base form of "ı". That's bizarre, but true, because it's treating "basic latin" as the base alphabet. ("Basic Latin" is itself a bizarre classification, with letters j, u, and w being somewhat modern additions. But still we call it "Latin.")

Unicode tries to "thread the needle" on these complex issues, with varying success. It tends to be biased towards Romance languages (and particularly Western European countries). But it does try. And it has a focus on what users will expect. So should a search for "halla" find "Hallå." I'm betting that most Swedes would consider that "close enough."

Keyboards are designed to be useful to the cultures they're created for, so whether a particular symbol appears on the keyboard shouldn't be assumed to be making any strong claim about how the alphabet works. The iOS Arabic keyboard includes the half-letter "ء". That isn't making a claim about how the alphabet works. It's just saying that ء is somewhat commonly typed when writing Arabic.

Not getting max element of array

That is an array of strings, and those are compared lexicographically:

"1" < "10" < "2" < ... < "9" 

For example "10" < "2" because the initial characters already
satisfy "1" < "2". (For the gory details, see for example
What does it mean that string and character comparisons in Swift are not locale-sensitive?.)

Using an array of integers would be the best solution:

let prodIDArr = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
let maxId = prodIDArr.max()
print(maxId) // Optional(10)

If that is not possible then you can enforce a numeric comparison with

let prodIDArr = ["1","2","3","4","5","6","7","8","9","10"]
let maxId = prodIDArr.max(by: { $0.compare($1, options: .numeric) == .orderedAscending })
print(maxId) // Optional("10")


Related Topics



Leave a reply



Submit