What Is a Safe Way to Turn Streamed (Utf8) Data into a String

What is a safe way to turn streamed (utf8) data into a string?

The tool you probably want to use here is UTF8. It will handle all the state issues for you. See How to cast decrypted UInt8 to String? for a simple example that you can likely adapt.

The major concern in building up a string from UTF-8 data isn't composed characters, but rather multi-byte characters. "LATIN SMALL LETTER A" + "COMBINING GRAVE ACCENT" works fine even if decode each of those characters separately. What doesn't work is gathering the first byte of 你, decoding it, and then appending the decoded second byte. The UTF8 type will handle this for you, though. All you need to do is bridge your NSInputStream to a GeneratorType.

Here's a basic (not fully production-ready) example of what I'm talking about. First, we need a way to convert an NSInputStream into a generator. That's probably the hardest part:

final class StreamGenerator {
static let bufferSize = 1024
let stream: NSInputStream
var buffer = [UInt8](count: StreamGenerator.bufferSize, repeatedValue: 0)
var buffGen = IndexingGenerator<ArraySlice<UInt8>>([])

init(stream: NSInputStream) {
self.stream = stream
stream.open()
}
}

extension StreamGenerator: GeneratorType {
func next() -> UInt8? {
// Check the stream status
switch stream.streamStatus {
case .NotOpen:
assertionFailure("Cannot read unopened stream")
return nil
case .Writing:
preconditionFailure("Impossible status")
case .AtEnd, .Closed, .Error:
return nil // FIXME: May want a closure to post errors
case .Opening, .Open, .Reading:
break
}

// First see if we can feed from our buffer
if let result = buffGen.next() {
return result
}

// Our buffer is empty. Block until there is at least one byte available
let count = stream.read(&buffer, maxLength: buffer.capacity)

if count <= 0 { // FIXME: Probably want a closure or something to handle error cases
stream.close()
return nil
}

buffGen = buffer.prefix(count).generate()
return buffGen.next()
}
}

Calls to next() can block here, so it should not be called on the main queue, but other than that, it's a standard Generator that spits out bytes. (This is also the piece that probably has lots of little corner cases that I'm not handling, so you want to think this through pretty carefully. Still, it's not that complicated.)

With that, creating a UTF-8 decoding generator is almost trivial:

final class UnicodeScalarGenerator<ByteGenerator: GeneratorType where ByteGenerator.Element == UInt8> {
var byteGenerator: ByteGenerator
var utf8 = UTF8()
init(byteGenerator: ByteGenerator) {
self.byteGenerator = byteGenerator
}
}

extension UnicodeScalarGenerator: GeneratorType {
func next() -> UnicodeScalar? {
switch utf8.decode(&byteGenerator) {
case .Result(let scalar): return scalar
case .EmptyInput: return nil
case .Error: return nil // FIXME: Probably want a closure or something to handle error cases
}
}
}

You could of course trivially turn this into a CharacterGenerator instead (using Character(_:UnicodeScalar)).

The last problem is if you want to combine all combining marks, such that "LATIN SMALL LETTER A" followed by "COMBINING GRAVE ACCENT" would always be returned together (rather than as the two characters they are). That's actually a bit trickier than it sounds. First, you'd need to generate Strings, not Characters. And then you'd need a good way to know what all the combining characters are. That's certainly knowable, but I'm having a little trouble deriving a simple algorithm. There's no "combiningMarkCharacterSet" in Cocoa. I'm still thinking about it. Getting something that "mostly works" is easy, but I'm not sure yet how to build it so that it's correct for all of Unicode.

Here's a little sample program to try it out:

    let textPath = NSBundle.mainBundle().pathForResource("text.txt", ofType: nil)!
let inputStream = NSInputStream(fileAtPath: textPath)!
inputStream.open()

dispatch_async(dispatch_get_global_queue(0, 0)) {
let streamGen = StreamGenerator(stream: inputStream)
let unicodeGen = UnicodeScalarGenerator(byteGenerator: streamGen)
var string = ""
for c in GeneratorSequence(unicodeGen) {
print(c)
string += String(c)
}
print(string)
}

And a little text to read:


Here is some normalish álfa你好 text
And some Zalgo i̝̲̲̗̹̼n͕͓̘v͇̠͈͕̻̹̫͡o̷͚͍̙͖ke̛̘̜̘͓̖̱̬ composed stuff
And one more line with no newline

(That second line is some Zalgo encoded text, which is nice for testing.)

I haven't done any testing with this in a real blocking situation, like reading from the network, but it should work based on how NSInputStream works (i.e. it should block until there's at least one byte to read, but then should just fill the buffer with whatever's available).

I've made all of this match GeneratorType so that it plugs into other things easily, but error handling might work better if you didn't use GeneratorType and instead created your own protocol with next() throws -> Self.Element instead. Throwing would make it easier to propagate errors up the stack, but would make it harder to plug into for...in loops.

Swift 3: how to convert a UTF8 data stream (1,2,3 or 4 bytes per char) to String?

As you know, single UTF-8 character is either in 1, 2, 3 or 4 bytes.
For your case, you need to handle 1 or 2 byte characters. And your receiving byte sequence may not be aligned to "character boundary".
However, as rmaddy pointed, the byte sequence to String.Encoding.utf8 must start and end with right boundary.

Now, there are two options to handle this situation.
One is, as rmaddy suggests, to send length at first and count incoming data bytes.
The drawback of this is that you have to modify transmit (server) side as well, which may not be possible.

Another option is to scan incoming sequence byte-by-byte and keep track the character boundary, then build up legitimate UTF-8 byte sequence.
Fortunately, UTF-8 is designed so that you can easily identify where the character boundary is
by seeing ANY byte in byte stream. Specifically, first byte of 1, 2, 3 and 4 byte UTF-8 character starts with 0xxxxxxx, 110xxxxx, 1110xxxx and 11110xxx respectively, and second..fourth bytes
are all in 10xxxxxx in bit representation. This makes your life a lot easier.

If you pick up your "end of message" marker from one of 1 byte UTF-8 characters,
you can easily and successfully detect EOM w/o considering byte sequence since it's a single byte and doesn't appear anywhere in 2..4 byte chars.

convert streamed buffers to utf8-string

Single Buffer

If you have a single Buffer you can use its toString method that will convert all or part of the binary contents to a string using a specific encoding. It defaults to utf8 if you don't provide a parameter, but I've explicitly set the encoding in this example.

var req = http.request(reqOptions, function(res) {
...

res.on('data', function(chunk) {
var textChunk = chunk.toString('utf8');
// process utf8 text chunk
});
});

Streamed Buffers

If you have streamed buffers like in the question above where the first byte of a multi-byte UTF8-character may be contained in the first Buffer (chunk) and the second byte in the second Buffer then you should use a StringDecoder. :

var StringDecoder = require('string_decoder').StringDecoder;

var req = http.request(reqOptions, function(res) {
...
var decoder = new StringDecoder('utf8');

res.on('data', function(chunk) {
var textChunk = decoder.write(chunk);
// process utf8 text chunk
});
});

This way bytes of incomplete characters are buffered by the StringDecoder until all required bytes were written to the decoder.

Is there a memory efficient way to convert input stream encoding

Since you already use commons.io library. This might be just what you're looking for:

InputStreamReader utf16Reader = new InputStreamReader(is, StandardCharsets.UTF_16);
ReaderInputStream utf8IS = new ReaderInputStream(utf16Reader, StandardCharsets.UTF_8);

Which double wraps is into utf16-decoding reader and then into utf8 encoding byte-stream

How do I convert ListString[] values from UTF-8 to String?

The problem is your use of FileReader which only supports the "default" character set:

this.fileReader = new FileReader("D:\\Book1.csv");

The javadoc for FileReader is very clear on this:

The constructors of this class assume that the default character
encoding and the default byte-buffer size are appropriate. To specify
these values yourself, construct an InputStreamReader on a
FileInputStream.

The appropriate way to get a Reader with a character set specified is as follows:

this.fileStream = new FileInputStream("D:\\Book1.csv");
this.fileReader = new InputStreamReader(fileStream, "utf-8");

How to convert Strings to and from UTF8 byte arrays in Java

Convert from String to byte[]:

String s = "some text here";
byte[] b = s.getBytes(StandardCharsets.UTF_8);

Convert from byte[] to String:

byte[] b = {(byte) 99, (byte)97, (byte)116};
String s = new String(b, StandardCharsets.US_ASCII);

You should, of course, use the correct encoding name. My examples used US-ASCII and UTF-8, two commonly-used encodings.

Converting Stream to String and back

This is so common but so profoundly wrong. Protobuf data is not string data. It certainly isn't ASCII. You are using the encoding backwards. A text encoding transfers:

  • an arbitrary string to formatted bytes
  • formatted bytes to the original string

You do not have "formatted bytes". You have arbitrary bytes. You need to use something like a base-n (commonly: base-64) encode. This transfers

  • arbitrary bytes to a formatted string
  • a formatted string to the original bytes

Look at Convert.ToBase64String and Convert.FromBase64String.

How do I read / convert an InputStream into a String in Java?

A nice way to do this is using Apache commons IOUtils to copy the InputStream into a StringWriter... something like

StringWriter writer = new StringWriter();
IOUtils.copy(inputStream, writer, encoding);
String theString = writer.toString();

or even

// NB: does not close inputStream, you'll have to use try-with-resources for that
String theString = IOUtils.toString(inputStream, encoding);

Alternatively, you could use ByteArrayOutputStream if you don't want to mix your Streams and Writers

String from NSInputStream is not valid utf8. How to convert to utf8 more 'lossy'

It works! By combining the code snippet from Larme and the comment about the size of UTF-8 characters I managed to create a 'lossy' NSData to UTF-8 NSString conversion method.

+ (NSString *) data2UTF8String:(NSData *) data {

// First try to do the 'standard' UTF-8 conversion
NSString * bufferStr = [[NSString alloc] initWithData:data
encoding:NSUTF8StringEncoding];

// if it fails, do the 'lossy' UTF8 conversion
if (!bufferStr) {
const Byte * buffer = [data bytes];

NSMutableString * filteredString = [[NSMutableString alloc] init];

int i = 0;
while (i < [data length]) {

int expectedLength = 1;

if ((buffer[i] & 0b10000000) == 0b00000000) expectedLength = 1;
else if ((buffer[i] & 0b11100000) == 0b11000000) expectedLength = 2;
else if ((buffer[i] & 0b11110000) == 0b11100000) expectedLength = 3;
else if ((buffer[i] & 0b11111000) == 0b11110000) expectedLength = 4;
else if ((buffer[i] & 0b11111100) == 0b11111000) expectedLength = 5;
else if ((buffer[i] & 0b11111110) == 0b11111100) expectedLength = 6;

int length = MIN(expectedLength, [data length] - i);
NSData * character = [NSData dataWithBytes:&buffer[i] length:(sizeof(Byte) * length)];

NSString * possibleString = [NSString stringWithUTF8String:[character bytes]];
if (possibleString) {
[filteredString appendString:possibleString];
}
i = i + expectedLength;
}
bufferStr = filteredString;
}

return bufferStr;
}

If you have any comments, please let me know.
Thanks Larme!



Related Topics



Leave a reply



Submit