Swift String Indexing Combines "\R\N" as One Char Instead of Two

Swift string indexing combines \r\n as one char instead of two

TLDR: \r\n is a grapheme cluster and is treated as a single Character in Swift because Unicode.



  • Swift treats \r\n as one Character.

  • Objective-C NSString treats it as two characters (in terms of the result from length).

On the swift-users forum someone wrote:

– "\r\n" is a single Character. Is this the correct behaviour?

– Yes, a Character corresponds to a Unicode grapheme cluster, and "\r\n" is considered a single grapheme cluster.

And the subsequent response posted a link to Unicode documentation, check out this table which officially states CRLF is a grapheme cluster.

Take a look at the Apple documentation on Characters and Grapheme Clusters.

It's common to think of a string as a sequence of characters, but when working with NSString objects, or with Unicode strings in general, in most cases it is better to deal with substrings rather than with individual characters. The reason for this is that what the user perceives as a character in text may in many cases be represented by multiple characters in the string.

The Swift documentation on Strings and Characters is also worth reading.

This overview from objc.io is interesting as well.

NSString represents UTF-16-encoded text. Length, indices, and ranges are all based on UTF-16 code units.

Another example of this is an emoji like . This single character is actually %uD83D%uDC4D%uD83C%uDFFB, four different unicode scalars. But if you called count on a string with just that emoji you'd (correctly) get 1.

If you wanted to see the scalars you could iterate them as follows:

for scalar in text.unicodeScalars {
print("\(scalar.value) ", terminator: "")
}

Which for "\r\n" would give you 13 10

In the Swift documentation you'll find why NSString is different:

The count of the characters returned by the count property isn’t always the same as the length property of an NSString that contains the same characters. The length of an NSString is based on the number of 16-bit code units within the string’s UTF-16 representation and not the number of Unicode extended grapheme clusters within the string.

Thus this isn't really "strange" behaviour of Swift string indexing, but rather a result of how Unicode treats these characters and how String in Swift is designed. Swift string indexing goes by Character and \r\n is a single Character.

How do you use String.substringWithRange? (or, how do Ranges work in Swift?)

You can use the substringWithRange method. It takes a start and end String.Index.

var str = "Hello, playground"
str.substringWithRange(Range<String.Index>(start: str.startIndex, end: str.endIndex)) //"Hello, playground"

To change the start and end index, use advancedBy(n).

var str = "Hello, playground"
str.substringWithRange(Range<String.Index>(start: str.startIndex.advancedBy(2), end: str.endIndex.advancedBy(-1))) //"llo, playgroun"

You can also still use the NSString method with NSRange, but you have to make sure you are using an NSString like this:

let myNSString = str as NSString
myNSString.substringWithRange(NSRange(location: 0, length: 3))

Note: as JanX2 mentioned, this second method is not safe with unicode strings.

How do I make a new line in swift

You should be able to use \n inside a Swift string, and it should work as expected, creating a newline character. You will want to remove the space after the \n for proper formatting like so:

var example: String = "Hello World \nThis is a new line"

Which, if printed to the console, should become:

Hello World
This is a new line

However, there are some other considerations to make depending on how you will be using this string, such as:

  • If you are setting it to a UILabel's text property, make sure that the UILabel's numberOfLines = 0, which allows for infinite lines.
  • In some networking use cases, use \r\n instead, which is the Windows newline.

Edit: You said you're using a UITextField, but it does not support multiple lines. You must use a UITextView.

Swift 3.0 iterate over String.Index range

You can traverse a string by using indices property of the characters property like this:

let letters = "string"
let middle = letters.index(letters.startIndex, offsetBy: letters.characters.count / 2)

for index in letters.characters.indices {

// to traverse to half the length of string
if index == middle { break } // s, t, r

print(letters[index]) // s, t, r, i, n, g
}

From the documentation in section Strings and Characters - Counting Characters:

Extended grapheme clusters can be composed of one or more Unicode scalars. This means that different characters—and different representations of the same character—can require different amounts of memory to store. Because of this, characters in Swift do not each take up the same amount of memory within a string’s representation. As a result, the number of characters in a string cannot be calculated without iterating through the string to determine its extended grapheme cluster boundaries.

emphasis is my own.

This will not work:

let secondChar = letters[1] 
// error: subscript is unavailable, cannot subscript String with an Int

Efficiently replace all accented characters in a string?

I can't speak to what you are trying to do specifically with the function itself, but if you don't like the regex being built every time, here are two solutions and some caveats about each.

Here is one way to do this:

function makeSortString(s) {
if(!makeSortString.translate_re) makeSortString.translate_re = /[öäüÖÄÜ]/g;
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
return ( s.replace(makeSortString.translate_re, function(match) {
return translate[match];
}) );
}

This will obviously make the regex a property of the function itself. The only thing you may not like about this (or you may, I guess it depends) is that the regex can now be modified outside of the function's body. So, someone could do this to modify the interally-used regex:

makeSortString.translate_re = /[a-z]/g;

So, there is that option.

One way to get a closure, and thus prevent someone from modifying the regex, would be to define this as an anonymous function assignment like this:

var makeSortString = (function() {
var translate_re = /[öäüÖÄÜ]/g;
return function(s) {
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
return ( s.replace(translate_re, function(match) {
return translate[match];
}) );
}
})();

Hopefully this is useful to you.


UPDATE: It's early and I don't know why I didn't see the obvious before, but it might also be useful to put you translate object in a closure as well:

var makeSortString = (function() {
var translate_re = /[öäüÖÄÜ]/g;
var translate = {
"ä": "a", "ö": "o", "ü": "u",
"Ä": "A", "Ö": "O", "Ü": "U" // probably more to come
};
return function(s) {
return ( s.replace(translate_re, function(match) {
return translate[match];
}) );
}
})();

Swift: How to get substring from start to last index of character

Just accessing backward

The best way is to use substringToIndex combined to the endIndexproperty and the advance global function.

var string1 = "www.stackoverflow.com"

var index1 = advance(string1.endIndex, -4)

var substring1 = string1.substringToIndex(index1)

Looking for a string starting from the back

Use rangeOfString and set options to .BackwardsSearch

var string2 = "www.stackoverflow.com"

var index2 = string2.rangeOfString(".", options: .BackwardsSearch)?.startIndex

var substring2 = string2.substringToIndex(index2!)

No extensions, pure idiomatic Swift

Swift 2.0

advance is now a part of Index and is called advancedBy. You do it like:

var string1 = "www.stackoverflow.com"

var index1 = string1.endIndex.advancedBy(-4)

var substring1 = string1.substringToIndex(index1)

Swift 3.0

You can't call advancedBy on a String because it has variable size elements. You have to use index(_, offsetBy:).

var string1 = "www.stackoverflow.com"

var index1 = string1.index(string1.endIndex, offsetBy: -4)

var substring1 = string1.substring(to: index1)

A lot of things have been renamed. The cases are written in camelCase, startIndex became lowerBound.

var string2 = "www.stackoverflow.com"

var index2 = string2.range(of: ".", options: .backwards)?.lowerBound

var substring2 = string2.substring(to: index2!)

Also, I wouldn't recommend force unwrapping index2. You can use optional binding or map. Personally, I prefer using map:

var substring3 = index2.map(string2.substring(to:))

Swift 4

The Swift 3 version is still valid but now you can now use subscripts with indexes ranges:

let string1 = "www.stackoverflow.com"

let index1 = string1.index(string1.endIndex, offsetBy: -4)

let substring1 = string1[..<index1]

The second approach remains unchanged:

let string2 = "www.stackoverflow.com"

let index2 = string2.range(of: ".", options: .backwards)?.lowerBound

let substring3 = index2.map(string2.substring(to:))

Find out if Character in String is emoji?

What I stumbled upon is the difference between characters, unicode scalars and glyphs.

For example, the glyph ‍‍‍ consists of 7 unicode scalars:

  • Four emoji characters: /li>
  • In between each emoji is a special character, which works like character glue; see the specs for more info

Another example, the glyph consists of 2 unicode scalars:

  • The regular emoji: /li>
  • A skin tone modifier: /li>

Last one, the glyph 1️⃣ contains three unicode characters:

  • The digit one: 1
  • The variation selector
  • The Combining Enclosing Keycap:

So when rendering the characters, the resulting glyphs really matter.

Swift 5.0 and above makes this process much easier and gets rid of some guesswork we needed to do. Unicode.Scalar's new Property type helps is determine what we're dealing with.
However, those properties only make sense when checking the other scalars within the glyph. This is why we'll be adding some convenience methods to the Character class to help us out.

For more detail, I wrote an article explaining how this works.

For Swift 5.0, this leaves you with the following result:

extension Character {
/// A simple emoji is one scalar and presented to the user as an Emoji
var isSimpleEmoji: Bool {
guard let firstScalar = unicodeScalars.first else { return false }
return firstScalar.properties.isEmoji && firstScalar.value > 0x238C
}

/// Checks if the scalars will be merged into an emoji
var isCombinedIntoEmoji: Bool { unicodeScalars.count > 1 && unicodeScalars.first?.properties.isEmoji ?? false }

var isEmoji: Bool { isSimpleEmoji || isCombinedIntoEmoji }
}

extension String {
var isSingleEmoji: Bool { count == 1 && containsEmoji }

var containsEmoji: Bool { contains { $0.isEmoji } }

var containsOnlyEmoji: Bool { !isEmpty && !contains { !$0.isEmoji } }

var emojiString: String { emojis.map { String($0) }.reduce("", +) }

var emojis: [Character] { filter { $0.isEmoji } }

var emojiScalars: [UnicodeScalar] { filter { $0.isEmoji }.flatMap { $0.unicodeScalars } }
}

Which will give you the following results:

"A̛͚̖".containsEmoji // false
"3".containsEmoji // false
"A̛͚̖▶️".unicodeScalars // [65, 795, 858, 790, 9654, 65039]
"A̛͚̖▶️".emojiScalars // [9654, 65039]
"3️⃣".isSingleEmoji // true
"3️⃣".emojiScalars // [51, 65039, 8419]
"quot;.isSingleEmoji // true
"‍♂️".isSingleEmoji // true
"quot;.isSingleEmoji // true
"⏰".isSingleEmoji // true
"quot;.isSingleEmoji // true
"‍‍‍quot;.isSingleEmoji // true
"quot;.isSingleEmoji // true
"quot;.containsOnlyEmoji // true
"‍‍‍quot;.containsOnlyEmoji // true
"Hello ‍‍‍quot;.containsOnlyEmoji // false
"Hello ‍‍‍quot;.containsEmoji // true
" Héllo ‍‍‍quot;.emojiString // "‍‍‍quot;
"‍‍‍quot;.count // 1

" Héllœ ‍‍‍quot;.emojiScalars // [128107, 128104, 8205, 128105, 8205, 128103, 8205, 128103]
" Héllœ ‍‍‍quot;.emojis // ["quot;, "‍‍‍quot;]
" Héllœ ‍‍‍quot;.emojis.count // 2

"‍‍‍‍‍quot;.isSingleEmoji // false
"‍‍‍‍‍quot;.containsOnlyEmoji // true

For older Swift versions, check out this gist containing my old code.



Related Topics



Leave a reply



Submit