Working With Unicode Code Points in Swift

Working with Unicode code points in Swift

Updated for Swift 3

String and Character

For almost everyone in the future who visits this question, String and Character will be the answer for you.

Set Unicode values directly in code:

var str: String = "I want to visit 北京, Москва, मुंबई, القاهرة, and 서울시. quot;
var character: Character = "quot;

Use hexadecimal to set values

var str: String = "\u{61}\u{5927}\u{1F34E}\u{3C0}" // a大π
var character: Character = "\u{65}\u{301}" // é = "e" + accent mark

Note that the Swift Character can be composed of multiple Unicode code points, but appears to be a single character. This is called an Extended Grapheme Cluster.

See this question also.

Convert to Unicode values:

str.utf8
str.utf16
str.unicodeScalars // UTF-32

String(character).utf8
String(character).utf16
String(character).unicodeScalars

Convert from Unicode hex values:

let hexValue: UInt32 = 0x1F34E

// convert hex value to UnicodeScalar
guard let scalarValue = UnicodeScalar(hexValue) else {
// early exit if hex does not form a valid unicode value
return
}

// convert UnicodeScalar to String
let myString = String(scalarValue) // br>

Or alternatively:

let hexValue: UInt32 = 0x1F34E
if let scalarValue = UnicodeScalar(hexValue) {
let myString = String(scalarValue)
}

A few more examples

let value0: UInt8 = 0x61
let value1: UInt16 = 0x5927
let value2: UInt32 = 0x1F34E

let string0 = String(UnicodeScalar(value0)) // a
let string1 = String(UnicodeScalar(value1)) // 大
let string2 = String(UnicodeScalar(value2)) // br>
// convert hex array to String
let myHexArray = [0x43, 0x61, 0x74, 0x203C, 0x1F431] // an Int array
var myString = ""
for hexValue in myHexArray {
myString.append(UnicodeScalar(hexValue))
}
print(myString) // Cat‼br>

Note that for UTF-8 and UTF-16 the conversion is not always this easy. (See UTF-8, UTF-16, and UTF-32 questions.)

NSString and unichar

It is also possible to work with NSString and unichar in Swift, but you should realize that unless you are familiar with Objective C and good at converting the syntax to Swift, it will be difficult to find good documentation.

Also, unichar is a UInt16 array and as mentioned above the conversion from UInt16 to Unicode scalar values is not always easy (i.e., converting surrogate pairs for things like emoji and other characters in the upper code planes).

Custom string structure

For the reasons mentioned in the question, I ended up not using any of the above methods. Instead I wrote my own string structure, which was basically an array of UInt32 to hold Unicode scalar values.

Again, this is not the solution for most people. First consider using extensions if you only need to extend the functionality of String or Character a little.

But if you really need to work exclusively with Unicode scalar values, you could write a custom struct.

The advantages are:

  • Don't need to constantly switch between Types (String, Character, UnicodeScalar, UInt32, etc.) when doing string manipulation.
  • After Unicode manipulation is finished, the final conversion to String is easy.
  • Easy to add more methods when they are needed
  • Simplifies converting code from Java or other languages

Disadavantages are:

  • makes code less portable and less readable for other Swift developers
  • not as well tested and optimized as the native Swift types
  • it is yet another file that has to be included in a project every time you need it

You can make your own, but here is mine for reference. The hardest part was making it Hashable.

// This struct is an array of UInt32 to hold Unicode scalar values
// Version 3.4.0 (Swift 3 update)


struct ScalarString: Sequence, Hashable, CustomStringConvertible {

fileprivate var scalarArray: [UInt32] = []


init() {
// does anything need to go here?
}

init(_ character: UInt32) {
self.scalarArray.append(character)
}

init(_ charArray: [UInt32]) {
for c in charArray {
self.scalarArray.append(c)
}
}

init(_ string: String) {

for s in string.unicodeScalars {
self.scalarArray.append(s.value)
}
}

// Generator in order to conform to SequenceType protocol
// (to allow users to iterate as in `for myScalarValue in myScalarString` { ... })
func makeIterator() -> AnyIterator {
return AnyIterator(scalarArray.makeIterator())
}

// append
mutating func append(_ scalar: UInt32) {
self.scalarArray.append(scalar)
}

mutating func append(_ scalarString: ScalarString) {
for scalar in scalarString {
self.scalarArray.append(scalar)
}
}

mutating func append(_ string: String) {
for s in string.unicodeScalars {
self.scalarArray.append(s.value)
}
}

// charAt
func charAt(_ index: Int) -> UInt32 {
return self.scalarArray[index]
}

// clear
mutating func clear() {
self.scalarArray.removeAll(keepingCapacity: true)
}

// contains
func contains(_ character: UInt32) -> Bool {
for scalar in self.scalarArray {
if scalar == character {
return true
}
}
return false
}

// description (to implement Printable protocol)
var description: String {
return self.toString()
}

// endsWith
func endsWith() -> UInt32? {
return self.scalarArray.last
}

// indexOf
// returns first index of scalar string match
func indexOf(_ string: ScalarString) -> Int? {

if scalarArray.count < string.length {
return nil
}

for i in 0...(scalarArray.count - string.length) {

for j in 0..
if string.charAt(j) != scalarArray[i + j] {
break // substring mismatch
}
if j == string.length - 1 {
return i
}
}
}

return nil
}

// insert
mutating func insert(_ scalar: UInt32, atIndex index: Int) {
self.scalarArray.insert(scalar, at: index)
}
mutating func insert(_ string: ScalarString, atIndex index: Int) {
var newIndex = index
for scalar in string {
self.scalarArray.insert(scalar, at: newIndex)
newIndex += 1
}
}
mutating func insert(_ string: String, atIndex index: Int) {
var newIndex = index
for scalar in string.unicodeScalars {
self.scalarArray.insert(scalar.value, at: newIndex)
newIndex += 1
}
}

// isEmpty
var isEmpty: Bool {
return self.scalarArray.count == 0
}

// hashValue (to implement Hashable protocol)
var hashValue: Int {

// DJB Hash Function
return self.scalarArray.reduce(5381) {
($0 << 5) &+ $0 &+ Int($1)
}
}

// length
var length: Int {
return self.scalarArray.count
}

// remove character
mutating func removeCharAt(_ index: Int) {
self.scalarArray.remove(at: index)
}
func removingAllInstancesOfChar(_ character: UInt32) -> ScalarString {

var returnString = ScalarString()

for scalar in self.scalarArray {
if scalar != character {
returnString.append(scalar)
}
}

return returnString
}
func removeRange(_ range: CountableRange) -> ScalarString? {

if range.lowerBound < 0 || range.upperBound > scalarArray.count {
return nil
}

var returnString = ScalarString()

for i in 0.. if i < range.lowerBound || i >= range.upperBound {
returnString.append(scalarArray[i])
}
}

return returnString
}


// replace
func replace(_ character: UInt32, withChar replacementChar: UInt32) -> ScalarString {

var returnString = ScalarString()

for scalar in self.scalarArray {
if scalar == character {
returnString.append(replacementChar)
} else {
returnString.append(scalar)
}
}
return returnString
}
func replace(_ character: UInt32, withString replacementString: String) -> ScalarString {

var returnString = ScalarString()

for scalar in self.scalarArray {
if scalar == character {
returnString.append(replacementString)
} else {
returnString.append(scalar)
}
}
return returnString
}
func replaceRange(_ range: CountableRange, withString replacementString: ScalarString) -> ScalarString {

var returnString = ScalarString()

for i in 0.. if i < range.lowerBound || i >= range.upperBound {
returnString.append(scalarArray[i])
} else if i == range.lowerBound {
returnString.append(replacementString)
}
}
return returnString
}

// set (an alternative to myScalarString = "some string")
mutating func set(_ string: String) {
self.scalarArray.removeAll(keepingCapacity: false)
for s in string.unicodeScalars {
self.scalarArray.append(s.value)
}
}

// split
func split(atChar splitChar: UInt32) -> [ScalarString] {
var partsArray: [ScalarString] = []
if self.scalarArray.count == 0 {
return partsArray
}
var part: ScalarString = ScalarString()
for scalar in self.scalarArray {
if scalar == splitChar {
partsArray.append(part)
part = ScalarString()
} else {
part.append(scalar)
}
}
partsArray.append(part)
return partsArray
}

// startsWith
func startsWith() -> UInt32? {
return self.scalarArray.first
}

// substring
func substring(_ startIndex: Int) -> ScalarString {
// from startIndex to end of string
var subArray: ScalarString = ScalarString()
for i in startIndex.. subArray.append(self.scalarArray[i])
}
return subArray
}
func substring(_ startIndex: Int, _ endIndex: Int) -> ScalarString {
// (startIndex is inclusive, endIndex is exclusive)
var subArray: ScalarString = ScalarString()
for i in startIndex.. subArray.append(self.scalarArray[i])
}
return subArray
}

// toString
func toString() -> String {
var string: String = ""

for scalar in self.scalarArray {
if let validScalor = UnicodeScalar(scalar) {
string.append(Character(validScalor))
}
}
return string
}

// trim
// removes leading and trailing whitespace (space, tab, newline)
func trim() -> ScalarString {

//var returnString = ScalarString()
let space: UInt32 = 0x00000020
let tab: UInt32 = 0x00000009
let newline: UInt32 = 0x0000000A

var startIndex = self.scalarArray.count
var endIndex = 0

// leading whitespace
for i in 0.. if self.scalarArray[i] != space &&
self.scalarArray[i] != tab &&
self.scalarArray[i] != newline {

startIndex = i
break
}
}

// trailing whitespace
for i in stride(from: (self.scalarArray.count - 1), through: 0, by: -1) {
if self.scalarArray[i] != space &&
self.scalarArray[i] != tab &&
self.scalarArray[i] != newline {

endIndex = i + 1
break
}
}

if endIndex <= startIndex {
return ScalarString()
}

return self.substring(startIndex, endIndex)
}

// values
func values() -> [UInt32] {
return self.scalarArray
}

}

func ==(left: ScalarString, right: ScalarString) -> Bool {
return left.scalarArray == right.scalarArray
}

func +(left: ScalarString, right: ScalarString) -> ScalarString {
var returnString = ScalarString()
for scalar in left.values() {
returnString.append(scalar)
}
for scalar in right.values() {
returnString.append(scalar)
}
return returnString
}

How to get unicode code point(s) representation of character/string in Swift?

Generally, the unicodeScalars property of a String returns a collection
of its unicode scalar values. (A Unicode scalar value is any
Unicode code point except high-surrogate and low-surrogate code points.)

Example:

print(Array("Á".unicodeScalars))  // ["A", "\u{0301}"]
print(Array(".unicodeScalars)) // ["\u{0001F496}"]

Up to Swift 3 there is no way to access
the unicode scalar values of a Character directly, it has to be
converted to a String first (for the Swift 4 status, see below).

If you want to see all Unicode scalar values as hexadecimal numbers
then you can access the value property (which is a UInt32 number)
and format it according to your needs.

Example (using the U+NNNN notation for Unicode values):

extension String {
func getUnicodeCodePoints() -> [String] {
return unicodeScalars.map { "U+" + String($0.value, radix: 16, uppercase: true) }
}
}

extension Character {
func getUnicodeCodePoints() -> [String] {
return String(self).getUnicodeCodePoints()
}
}


print("A".getUnicodeCodePoints()) // ["U+41"]
print("Á".getUnicodeCodePoints()) // ["U+41", "U+301"]
print(".getUnicodeCodePoints()) // ["U+1F496"]
print("SWIFT".getUnicodeCodePoints()) // ["U+53", "U+57", "U+49", "U+46", "U+54"]
print(".getUnicodeCodePoints()) // ["U+1F1EF", "U+1F1F4"]

Update for Swift 4:

As of Swift 4, the unicodeScalars of a Character can be
accessed directly,
see SE-0178 Add unicodeScalars property to Character. This makes the conversion to a String
obsolete:

let c: Character = "br>print(Array(c.unicodeScalars)) // ["\u{0001F1EF}", "\u{0001F1F4}"]

How to express Strings in Swift using Unicode hexadecimal values (UTF-16)

Updated for Swift 3

Character

The Swift syntax for forming a hexadecimal code point is

\u{n}

where n is a hexadecimal number up to 8 digits long. The valid range for a Unicode scalar is U+0 to U+D7FF and U+E000 to U+10FFFF inclusive. (The U+D800 to U+DFFF range is for surrogate pairs, which are not scalars themselves, but are used in UTF-16 for encoding the higher value scalars.)

Examples:

// The following forms are equivalent. They all produce "C". 
let char1: Character = "\u{43}"
let char2: Character = "\u{0043}"
let char3: Character = "\u{00000043}"

// Higher value Unicode scalars are done similarly
let char4: Character = "\u{203C}" // ‼ (DOUBLE EXCLAMATION MARK character)
let char5: Character = "\u{1F431}" // (cat emoji)

// Characters can be made up of multiple scalars
let char7: Character = "\u{65}\u{301}" // é = "e" + accent mark
let char8: Character = "\u{65}\u{301}\u{20DD}" // é⃝ = "e" + accent mark + circle

Notes:

  • Leading zeros can be added or omitted
  • Characters are known as extended grapheme clusters. Even when they are composed of multiple scalars, they are still considered a single character. What is key is that they appear to be a single character (grapheme) to the user.
  • TODO: How to convert surrogate pair to Unicode scalar in Swift

String

Strings are composed of characters. See the following examples for some ways to form them using hexadecimal code points.

Examples:

var string1 = "\u{0043}\u{0061}\u{0074}\u{203C}\u{1F431}" // Cat‼br>
// pass an array of characters to a String initializer
let catCharacters: [Character] = ["\u{0043}", "\u{0061}", "\u{0074}", "\u{203C}", "\u{1F431}"] // ["C", "a", "t", "‼", "]
let string2 = String(catCharacters) // Cat‼br>

Converting Hex Values at Runtime

At runtime you can convert hexadecimal or Int values into a Character or String by first converting it to a UnicodeScalar.

Examples:

// hex values
let value0: UInt8 = 0x43 // 97
let value1: UInt16 = 0x203C // 22823
let value2: UInt32 = 0x1F431 // 127822

// convert hex to UnicodeScalar
let scalar0 = UnicodeScalar(value0)
// make sure that UInt16 and UInt32 form valid Unicode values
guard
let scalar1 = UnicodeScalar(value1),
let scalar2 = UnicodeScalar(value2) else {
return
}

// convert to Character
let character0 = Character(scalar0) // C
let character1 = Character(scalar1) // ‼
let character2 = Character(scalar2) // br>
// convert to String
let string0 = String(scalar0) // C
let string1 = String(scalar1) // ‼
let string2 = String(scalar2) // br>
// convert hex array to String
let myHexArray = [0x43, 0x61, 0x74, 0x203C, 0x1F431] // an Int array
var myString = ""
for hexValue in myHexArray {
if let scalar = UnicodeScalar(hexValue) {
myString.append(Character(scalar))
}
}
print(myString) // Cat‼br>

Further reading

  • Strings and Characters docs
  • Glossary of Unicode Terms
  • Strings in Swift
  • Working with Unicode code points in Swift

If a sequence of code points forms a Unicode character, does every non-empty prefix of that sequence also form a valid character?

After taking a long hard look at the specification for computing the boundaries for extended grapheme clusters (EGCs) at https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules,
it is obvious that the rules for EGCs all have the shape of describing when it is allowed to append a code point to an existing EGC to form a longer EGC. From that fact alone my two questions follow: 1) Yes, every non-empty prefix of code points which form an EGC is also an EGC. 2) No, by adding a code point to a valid Unicode string you will not decrease its length in terms of number of EGCs it consists of.

So, given this, the following Swift code will extract the longest Unicode character from the start of a byte sequence (or return nil if there is no valid Unicode character there):

    func lex(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
// This code works under three assumptions, all of which are true:
// 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
// 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
// 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
var chars : [UInt8] = []
var result : String = ""
var resultLength = 0
func value() -> (length : Int, out : Character)? {
guard let character = result.first else { return nil }
return (length: resultLength, out: character)
}
var length = 0
var iterator = input.makeIterator()
while length - resultLength <= 4 {
guard let char = iterator.next() else { return value() }
chars.append(char)
length += 1
guard let s = String(bytes: chars, encoding: .utf8) else { continue }
guard s.count == 1 else { return value() }
result = s
resultLength = length
}
return value()
}

Convert unicode symbols \uXXXX in String to Character in Swift

You can use \u{my_unicode}:

print("Ain\u{2019}t this a beautiful day")
/* Prints "Ain’t this a beautiful day"

From the Language Guide - Strings and Characters - Unicode:

String literals can include the following special characters:

...

  • An arbitrary Unicode scalar, written as \u{n}, where n is a 1–8 digit
    hexadecimal number with a value equal to a valid Unicode code point

How is the 🇩🇪 character represented in Swift strings?

let flag = "\u{1f1e9}\u{1f1ea}"

then flag is .

For more regional indicator symbols, see:

http://en.wikipedia.org/wiki/Regional_Indicator_Symbol

Which unicode code can be used safely as reserved value?

If you insist on using Unicode.Scalar, nothing. Unicode.Scalar is designed to represent all valid characters in Unicode, including not-assigned-yet code points. So, it cannot represent noncharacters nor dangling surrogates ("\u{DC00}" also causes error).

And in Swift, String can contain all valid Unicode.Scalars including U+0000. For example "\u{0000}" (== "\0") is a valid String and its length (count) is 1. With using U+0000 as a meta-character, your code would not work with valid Swift Strings.


Consider using [UInt32: State] instead of [Unicode.Scalar: State]. Unicode uses only 0x0000...0x10FFFF (including noncharacters), so using values greater than 0x10FFFF is safe (in your meaning).

Also getting value property of Unicode.Scalar takes very small cost and its cost can be ignored in optimized code. I'm not sure using Dictionary is really a good way to handle your requirements, but [UInt32: State] would be as efficient as [Unicode.Scalar: State].



Related Topics



Leave a reply



Submit