From Any Utf-16 Offset, Find the Corresponding String.Index That Lies on a Character Boundary

From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary

A possible solution, using the rangeOfComposedCharacterSequence(at:)
method:

extension String {
func index(utf16Offset: Int) -> String.Index? {
guard utf16Offset >= 0 && utf16Offset < utf16.count else { return nil }
let idx = String.Index(encodedOffset: utf16Offset)
let range = rangeOfComposedCharacterSequence(at: idx)
return range.lowerBound
}
}

Example:

let str = "a‍bcd‍‍‍e"
for utf16Offset in 0..<str.utf16.count {
if let idx = str.index(utf16Offset: utf16Offset) {
print(utf16Offset, str[idx])
}
}

Output:


0 a
1 ‍br>2 ‍br>3 ‍br>4 ‍br>5 ‍br>6 ‍br>7 ‍br>8 b
9 br>10 br>11 br>12 br>13 c
14 br>15 br>16 d
17 ‍‍‍br>18 ‍‍‍br>19 ‍‍‍br>20 ‍‍‍br>21 ‍‍‍br>22 ‍‍‍br>23 ‍‍‍br>24 ‍‍‍br>25 ‍‍‍br>26 ‍‍‍br>27 ‍‍‍br>28 e

Convert UnicodeScalar index to String.Index

The various String views share a common index. If you have a position given as an offset into the UnicodeScalar view then use String.unicodeScalars.index() to convert it to a String.Index. Example:

let s = "br>print(Array(s.unicodeScalars))
// ["\u{0001F1E6}", "\u{0001F1F9}", "\u{0001F1E7}", "\u{0001F1EA}"]

let ucOffset = 2
let sIndex = s.unicodeScalars.index(s.startIndex, offsetBy: ucOffset)
print(s[sIndex...]) // br>

The reverse calculation is done with distance(from:to:). Example:

let s = "br>
if let sIndex = s.index(of:") {
let ucOffset = s.unicodeScalars.distance(from: s.startIndex, to: sIndex)
print(ucOffset) // 2
}

If a sequence of code points forms a Unicode character, does every non-empty prefix of that sequence also form a valid character?

After taking a long hard look at the specification for computing the boundaries for extended grapheme clusters (EGCs) at https://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules,
it is obvious that the rules for EGCs all have the shape of describing when it is allowed to append a code point to an existing EGC to form a longer EGC. From that fact alone my two questions follow: 1) Yes, every non-empty prefix of code points which form an EGC is also an EGC. 2) No, by adding a code point to a valid Unicode string you will not decrease its length in terms of number of EGCs it consists of.

So, given this, the following Swift code will extract the longest Unicode character from the start of a byte sequence (or return nil if there is no valid Unicode character there):

    func lex<S : Sequence>(_ input : S) -> (length : Int, out: Character)? where S.Element == UInt8 {
// This code works under three assumptions, all of which are true:
// 1) If a sequence of codepoints does not form a valid character, then appending codepoints to it does not yield a valid character
// 2) Appending codepoints to a sequence of codepoints does not decrease its length in terms of extended grapheme clusters
// 3) a codepoint takes up at most 4 bytes in an UTF8 encoding
var chars : [UInt8] = []
var result : String = ""
var resultLength = 0
func value() -> (length : Int, out : Character)? {
guard let character = result.first else { return nil }
return (length: resultLength, out: character)
}
var length = 0
var iterator = input.makeIterator()
while length - resultLength <= 4 {
guard let char = iterator.next() else { return value() }
chars.append(char)
length += 1
guard let s = String(bytes: chars, encoding: .utf8) else { continue }
guard s.count == 1 else { return value() }
result = s
resultLength = length
}
return value()
}

Convert between diacritic variants of a character

I think you need to use precomposedStringWithCanonicalMapping. This converts the string to Normalization Form C, which is:

Canonical Decomposition, followed by Canonical Composition

Example:

let string = "à á ả ã ạ й ё"
print(string.unicodeScalars.count) // 20
print(string.precomposedStringWithCanonicalMapping.unicodeScalars.count) // 13

Get range from unicode character symbols. Swift

As it turned out in the discussion:

  • OP is using the Google keyboard,
  • the text view delegate method is called with

    textView.text = "( ͡° ͜ʖ ͡°)༼ つ ͡° ͜ʖ ͡br>range = { 27, 1 }
  • and then

    let newRange = Range(range, in: textView.text)

    returns nil.

The reason is that the range points into the “middle” of the /code> character,
which is stored as a UTF-16 surrogate pair. Here is a simplified self-contained
example:

let text = "Hello !"
let range = NSRange(location: 7, length: 1)
let newRange = Range(range, in: text)
print(newRange as Any) // nil br>

This looks like a bug (in the Google keyboard?) to me, but there is a possible workaround.

The “trick” is to determine the closest surrounding range of “composed
character sequences,” and here is how that can be done
(compare From any UTF-16 offset, find the corresponding String.Index that lies on a Character boundary):

extension String {
func safeRange(from nsRange: NSRange) -> Range<String.Index>? {
guard nsRange.location >= 0 && nsRange.location <= utf16.count else { return nil }
guard nsRange.length >= 0 && nsRange.location + nsRange.length <= utf16.count else { return nil }
let from = String.Index(encodedOffset: nsRange.location)
let to = String.Index(encodedOffset: nsRange.location + nsRange.length)
return rangeOfComposedCharacterSequences(for: from..<to)
}
}

Now

let newRange = textView.text.safeRange(from: range)

returns a String range that enclosed the entire /code> character. In our
simplified example:

let text = "Hello !"
let range = NSRange(location: 7, length: 1)
let newRange = text.safeRange(from: range)
print(newRange as Any) // Optional(Range(...)) br>print(text.replacingCharacters(in: newRange!, with: ")) // Hello !

What is a GraphemeCluster and what does ExpressibleByExtendedGraphemeClusterLiteral do?

A grapheme cluster is a collection of symbols that together represent an individual character that the user will see within a string on the screen. It generally comprises a "base character" plus what Apple calls "combining marks", and are used, for instance, when there is no available precomposed single Unicode character that might do the job for you.

When grapheme clusters are used in strings you have to take special care that any functions looking for substrings etc. are able properly to demarcate the boundaries between clusters.

You can see several examples here:

https://developer.apple.com/library/content/documentation/Swift/Conceptual/Swift_Programming_Language/
StringsAndCharacters.html

Compliance with the protocol ExpressibleByExtendedGraphemeClusterLiteral simply means that the character in question can be initialised with a literal grapheme cluster. Again, you can see examples of this in te above link.

What's the best way to transform an Array of type Character to a String in Swift?

Which to me sounds at least homogeneous enough to think it would be reasonable to implement the joinWithSeparator method to support the Character type. So, does anyone have a good answer as to why they don't do that???

This may be an oversight in the design. This error occurs because there are two possible candidates for joinWithSeparator(_:). I suspect this ambiguity exists because of the way Swift can implicit interpret double quotes as either String or Character. In this context, it's ambiguous as to which to choose.

  1. The first candidate is joinWithSeparator(_: String) -> String. It does what you're looking for.

    If the separator is treated as a String, this candidate is picked, and the result would be: "C,a,t,!,/code>

  2. The second is joinWithSeparator<Separator : SequenceType where Separator.Generator.Element == Generator.Element.Generator.Element>(_: Separator) -> JoinSequence<Self>. It's called on a Sequence of Sequences, and given a Sequence as a seperator. The method signature is a bit of a mouthful, so lets break it down. The argument to this function is of Separator type. This Separator is constrained to be a SequenceType where the elements of the sequence (Seperator.Generator.Element) must have the same type as the elements of this sequence of sequences (Generator.Element.Generator.Element).

    The point of that complex constraint is to ensure that the Sequence remains homogeneous. You can't join sequences of Int with sequences of Double, for example.

    If the separator is treated as a Character, this candidate is picked, the result would be: ["C", ",", "a", ",", "t", ",", "!", ",", "]

The compiler throws an error to ensure you're aware that there's an ambiguity. Otherwise, the program might behave differently than you'd expect.

You can disambiguate this situation by this by explicitly making each Character into a String. Because String is NOT a SequenceType, the #2 candidate is no longer possible.

var myChars: [Character] = ["C", "a", "t", "!", "]
var anotherVar = myChars.map(String.init).joinWithSeparator(",")

print(anotherVar) //C,a,t,!,br>

Determine the position index of a character within an HTML element when clicked

$('div').click( function () {
getSelectionPosition ();
});

function getSelectionPosition () {
var selection = window.getSelection();
console.log(selection.focusNode.data[selection.focusOffset]);
alert(selection.focusOffset);
}

This works with "click", as well as with a "range" for most browsers. (selection.type = "caret" / selection.type = "range").

selection.focusOffset() gives you the position in the inner node. If elements are nested, within <b> or <span> tags for example, it will give you the position inside the inner element, not the full text, more or less. I'm unable to "select" the first letter of a sub tag with focusOffset and "caret" type (click, not range select). When you click on the first letter, it gives the position of the last element before the start of tag plus 1. When you click on the second letter, it correctly gives you "1". But I didn't find a way to access the first element (offset 0) of the sub element. This "selection/range" stuff seems buggy (or very non-intuitive to me). ^^

But it's quite simple to use without nested elements! (Works fine with your <div>)

Here is a fiddle

Important edit 2015-01-18:

This answer worked back when it was accepted, but not anymore, for reasons given below. Other answers are now most useful.

  • Matthew's general answer
  • The working example provided by Douglas Daseeco.

Both Firefox and Chrome debugged window.getSelection() behavior. Sadly, it is now useless for this use case. (Reading documentation, IE 9 and beyond shall behave the same).

Now, the middle of a character is used to decide the offset. That means that clicking on a character can give back 2 results. 0 or 1 for the first character, 1 or 2 for second, etc.

I updated the JSFiddle example.

Please note that if you resize the window (Ctrl + mouse), the behavior is quite buggy on Chrome for some clicks.



Related Topics



Leave a reply



Submit