Why Is Swift Counting This Grapheme Cluster as Two Characters Instead of One

Why is Swift counting this Grapheme Cluster as two characters instead of one?

Part of the answer is given in the bug report mentioned in emrys57's comment. When splitting a Unicode string into "characters", Swift apparently uses the Grapheme Cluster Boundaries defined in UAX #29 Unicode Text Segmentation. There's a rule not to break between regional indicator symbols, but there is no such rule for Emoji modifiers. So, according to UAX #29, the string "\u{1f6b4}\u{1f3fe}" contains two grapheme clusters. See this message from Ken Whistler on the Unicode mailing list for an explanation:

This results from the fact that the fallback behavior for the modifiers is
simply as independent pictographic blorts, i.e. the color swatch images. [...] You need additional, specific
knowledge about these sequences -- it doesn't just fall out from a
default implementation of UAX #29 rules for grapheme clusters.

Swift string indexing combines \r\n as one char instead of two

TLDR: \r\n is a grapheme cluster and is treated as a single Character in Swift because Unicode.



  • Swift treats \r\n as one Character.

  • Objective-C NSString treats it as two characters (in terms of the result from length).

On the swift-users forum someone wrote:

– "\r\n" is a single Character. Is this the correct behaviour?

– Yes, a Character corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.

And the subsequent response posted a link to Unicode documentation, check out this table which officially states CRLF is a grapheme cluster.

Take a look at the Apple documentation on Characters and Grapheme Clusters.

It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.

The Swift documentation on Strings and Characters is also worth reading.

This overview from objc.io is interesting as well.

NSString represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.

Another example of this is an emoji like . This single character is actually %uD83D%uDC4D%uD83C%uDFFB, four different unicode scalars. But if you called count on a string with just that emoji you'd (correctly) get 1.

If you wanted to see the scalars you could iterate them as follows:

for scalar in text.unicodeScalars {
print("\(scalar.value) ", terminator: "")
}

Which for "\r\n" would give you 13 10

In the Swift documentation you'll find why NSString is different:

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

Thus this isn't really "strange" behaviour of Swift string indexing, but rather a result of how Unicode treats these characters and how String in Swift is designed. Swift string indexing goes by Character and \r\n is a single Character.

If a sequence of code points forms a Unicode character, does every non-empty prefix of that sequence also form a valid character?

After taking a long hard look at the specification for computing the boundaries for extended grapheme clusters (EGCs) at https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules,
it is obvious that the rules for EGCs all have the shape of describing when it is allowed to append a code point to an existing EGC to form a longer EGC. From that fact alone my two questions follow: 1) Yes, every non-empty prefix of code points which form an EGC is also an EGC. 2) No, by adding a code point to a valid Unicode string you will not decrease its length in terms of number of EGCs it consists of.

So, given this, the following Swift code will extract the longest Unicode character from the start of a byte sequence (or return nil if there is no valid Unicode character there):

    func lex<S : Sequence>(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
// This code works under three assumptions, all of which are true:
// 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
// 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
// 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
var chars : [UInt8] = []
var result : String = ""
var resultLength = 0
func value() -> (length : Int, out : Character)? {
guard let character = result.first else { return nil }
return (length: resultLength, out: character)
}
var length = 0
var iterator = input.makeIterator()
while length - resultLength <= 4 {
guard let char = iterator.next() else { return value() }
chars.append(char)
length += 1
guard let s = String(bytes: chars, encoding: .utf8) else { continue }
guard s.count == 1 else { return value() }
result = s
resultLength = length
}
return value()
}

Swift Dictionary is slow?

Because complexity of count in String is O(n), so that you should save count in a variable. You can read at chapter
Strings and Characters in Swift Book

Extended grapheme clusters can be composed of multiple Unicode scalars. This means that different characters—and different representations of the same character—can require different amounts of memory to store. Because of this, characters in Swift don’t each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string can’t be calculated without iterating through the string to determine its extended grapheme cluster boundaries. If you are working with particularly long string values, be aware that the count property must iterate over the Unicode scalars in the entire string in order to determine the characters for that string.

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

Is creation of an instance of type ArraySliceCharacter via this approach an O(1) time operation?

No, an array slice cannot be created in O(1) in this way. Specifically, array slices can only be created from an already existing array in O(1) time; the initializer you are calling on ArraySlice has to create a copy of the sequence you pass in (O(n)) and creates a slice out of the intermediate array it creates (O(1)):

@inlinable
public init<S: Sequence>(_ s: S)
where S.Element == Element {

self.init(_buffer: s._copyToContiguousArray()._buffer)
}

Wouldn't the string have to first be traversed entirely to know the actual size of each character?

This is exactly correct.

Why should we use String.Index instead of Int as index of Character in String?

First, you can't use Int as an index for a string. The interface requires String.Index.

Why? We are using Unicode, not ASCII. The unit for Swift strings is a Character, which is "Grapheme Cluster". A character can consist of multiple Unicode code points, and each Unicode code point can consist of 1 to 4 bytes.

Now lets say you have a string of 10 megabyte and did a search to find the substring "Wysteria". Would you want to return which character number the string starts with? If it's character 123,456 then to find the same string again, we have to start at the beginning of the string, and analyze 123,456 characters to find that substring. That is madly inefficient.

Instead we get a String.Index which is something that allows Swift to locate that substring quickly. It is most likely the byte offset, so it can be accessed very quickly.

Now adding "1" to that byte offset is nonsense, because you don't know how long the first character is. (It's quite possible that Unicode has another character that equals the ASCII 'W'). So you need to call a function that returns the index of the next character.

You can write code that returns the second Character from a string. To return the one millionth Character takes significant time. Swift doesn't allow you to do things that are enormously inefficient.



Related Topics



Leave a reply



Submit