Can the Conversion of a String to Data with Utf-8 Encoding Ever Fail

Can the conversion of a String to Data with UTF-8 encoding ever fail?

UTF-8 can represent all valid Unicode code points, therefore a conversion
of a Swift string to UTF-8 data cannot fail.

The forced unwrap in

let string = "some string .."
let data = string.data(using: .utf8)!

is safe.

The same would be true for .utf16 or .utf32, but not for
encodings which represent only a restricted character set,
such as .ascii or .isoLatin1.

You can alternatively use the .utf8 view of a string to create UTF-8 data,
avoiding the forced unwrap:

let string = "some string .."
let data = Data(string.utf8)

What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python

If you want an ASCII string that unambiguously represents what you have got, without losing any information, the answer is simple:

Don't muck about with encode/decode, use the repr() function (Python 2.X) or the ascii() function (Python 3.x).

String from NSData fails using UTF8 but succeeds using ASCII

The problem is that not every sequence of bytes is valid if interpreted as UTF-8. For example, a single byte with a value of 0xff = 255 is never valid in UTF-8. On the other hand, it might be that the ASCII encoding allows every byte value, even though this is not really correct.

You better have a good look at the data and see what encoding it actually is. And if it is just random bytes, then please do NOT convert them to a string.

Convert UTF-8 encoded NSData to NSString

If the data is not null-terminated, you should use -initWithData:encoding:

NSString* newStr = [[NSString alloc] initWithData:theData encoding:NSUTF8StringEncoding];

If the data is null-terminated, you should instead use -stringWithUTF8String: to avoid the extra \0 at the end.

NSString* newStr = [NSString stringWithUTF8String:[theData bytes]];

(Note that if the input is not properly UTF-8-encoded, you will get nil.)



Swift variant:

let newStr = String(data: data, encoding: .utf8)
// note that `newStr` is a `String?`, not a `String`.

If the data is null-terminated, you could go though the safe way which is remove the that null character, or the unsafe way similar to the Objective-C version above.

// safe way, provided data is \0-terminated
let newStr1 = String(data: data.subdata(in: 0 ..< data.count - 1), encoding: .utf8)
// unsafe way, provided data is \0-terminated
let newStr2 = data.withUnsafeBytes(String.init(utf8String:))

PHP: Convert any string to UTF-8 without knowing the original character set, or at least try

What you're asking for is extremely hard. If possible, getting the user to specify the encoding is the best. Preventing an attack shouldn't be much easier or harder that way.

However, you could try doing this:

iconv(mb_detect_encoding($text, mb_detect_order(), true), "UTF-8", $text);

Setting it to strict might help you get a better result.

Convert string of unknown encoding to UTF-8

"Träume groß" is a hint that you got something originally encoded as utf-8, but your process read it as cp1252.

A possible way is to encode your string back to cp1252 and then correctly decode it as utf-8:

print('"Träume groß"'.encode('cp1252').decode('utf8'))

gives as expected:

"Träume groß"

But this is only a workaround. The correct solution is to understand where you have read the original bytes as cp1252 and directly use the utf8 conversion there.



Related Topics



Leave a reply



Submit