Encoding Byte Data into Digits

Encoding byte data into digits

You can think of a (single byte character) string as a base-256 encoded number where "\x00" represents 0, ' ' (space, i.e., "\x20") represents 32 and so on until "\xFF", which represents 255.

A representation only with numbers 0-9 can be accomplished simply by changing the representation to base 10.

Note that "base64 encoding" is not actually a base conversion. base64 breaks the input into groups of 3 bytes (24 bits) and does the base conversion on those groups individually. This works well because a number with 24 bits can be represented with four digits in base 64 (2^24 = 64^4).

This is more or less what el.pescado does – he splits the input data into 8-bit pieces and then converts the number into base 10. However, this technique has one disadvantage relatively to base 64 encoding – it does not align correctly with the byte boundary. To represent a number with 8-bits (0-255 when unsigned) we need three digits in base 10. However, the left-most digit has less information than the others. It can either be 0, 1 or 2 (for unsigned numbers).

A digit in base 10 stores log(10)/log(2) bits. No matter the chunk size you choose, you're never going to be able to align the representations with 8-bit bytes (in the sense of "aligning" I've described in the paragraph before). Consequently, the most compact representation is a base conversion (which you can see as if it were a "base encoding" with only one big chunk).

Here is an example with bcmath.

bcscale(0);
function base256ToBase10(string $string) {
//argument is little-endian
$result = "0";
for ($i = strlen($string)-1; $i >= 0; $i--) {
$result = bcadd($result,
bcmul(ord($string[$i]), bcpow(256, $i)));
}
return $result;
}
function base10ToBase256(string $number) {
$result = "";
$n = $number;
do {
$remainder = bcmod($n, 256);
$n = bcdiv($n, 256);
$result .= chr($remainder);
} while ($n > 0);

return $result;
}

For

$string = "Mary had a little lamb";
$base10 = base256ToBase10($string);
echo $base10,"\n";
$base256 = base10ToBase256($base10);
echo $base256;

we get


36826012939234118013885831603834892771924668323094861
Mary had a little lamb

Since each digit encodes only log(10)/log(2)=~3.32193 bits expect the number to tend to be 140% longer (not 200% longer, as would be with el.pescado's answer).

Convert byte array to numbers in JavaScript

This was the only way I could think of off the top of my head to do it.

function bytesToDouble(str,start) {
start *= 8;
var data = [str.charCodeAt(start+7),
str.charCodeAt(start+6),
str.charCodeAt(start+5),
str.charCodeAt(start+4),
str.charCodeAt(start+3),
str.charCodeAt(start+2),
str.charCodeAt(start+1),
str.charCodeAt(start+0)];

var sign = (data[0] & 1<<7)>>7;

var exponent = (((data[0] & 127) << 4) | (data[1]&(15<<4))>>4);

if(exponent == 0) return 0;
if(exponent == 0x7ff) return (sign) ? Number.POSITIVE_INFINITY : Number.NEGATIVE_INFINITY;

var mul = Math.pow(2,exponent - 1023 - 52);
var mantissa = data[7]+
data[6]*Math.pow(2,8*1)+
data[5]*Math.pow(2,8*2)+
data[4]*Math.pow(2,8*3)+
data[3]*Math.pow(2,8*4)+
data[2]*Math.pow(2,8*5)+
(data[1]&15)*Math.pow(2,8*6)+
Math.pow(2,52);

return Math.pow(-1,sign)*mantissa*mul;
}

var data = atob("AAAAAABsskAAAAAAAPmxQAAAAAAAKrF");

alert(bytesToDouble(data,0)); // 4716.0
alert(bytesToDouble(data,1)); // 4601.0

This should give you a push in the right direction, though it took me a while to remember how to deal with doubles.

One big caveats to note though:

This relies on the atob to do the base64 decoding, which is not supported everywhere, and aside from that probably isn't a great idea anyway. What you really want to do is unroll the base64 encoded string to an array of numbers (bytes would be the easiest to work with although not the most efficient thing on the planet). The reason is that when atob does its magic, it returns a string, which is far from ideal. Depending on the encoding the code points it maps to (especially for code points between 128 and 255) the resulting .charCodeAt() may not return what you expect.

And there may be some accuracy issues, because after all I am using a double to calculate a double, but I think it might be okay.

Base64 is fairly trivial to work with, so you should be able to figure that part out.

If you did switch to an array (rather than the str string now), then you would obviously drop the .charCodeAt() reference and just get the indices you want directly.

There is a functioning fiddle here

How to convert from []byte to int in Go Programming

You can use encoding/binary's ByteOrder to do this for 16, 32, 64 bit types

Play

package main

import "fmt"
import "encoding/binary"

func main() {
var mySlice = []byte{244, 244, 244, 244, 244, 244, 244, 244}
data := binary.BigEndian.Uint64(mySlice)
fmt.Println(data)
}

How to encode a 8 byte block using only digits (numeric characters)?

Treat the 8 bytes as a 64-bit unsigned integer and convert it to decimal and pad it to the left with zeroes. That should make for the shortest possible string, as it utilizes all available digits in all positions except the starting one.

If your data is not uniformly distributed there are other alternatives, looking into Huffman-coding so that the most commonly data patterns can be represented by shorter strings. One way is to use the first digit to encode the length of the string. All numbers except 1 in the first position can be treated as a length specifier. That way the maximum length of 20 digits will never be exceeded. (The 20th digit can only be 0 or 1, the highest 64-bit number is 18,446,744,073,709,551,615.) The exact interpretation mapping of the other digits into lengths should be based on the distribution of your patterns. If you have 10 patterns which are occuring VERY often you could e.g. reserv "0" to mean that one digit represents a complete sequence.

Any such more complicated encoding will however introduce the need for more complex packing/unpacking code and maybe even lookup tables, so it might not be worth the effort.

Problems converting byte array to string and back to byte array

It is not a good idea to store encrypted data in Strings because they are for human-readable text, not for arbitrary binary data. For binary data it's best to use byte[].

However, if you must do it you should use an encoding that has a 1-to-1 mapping between bytes and characters, that is, where every byte sequence can be mapped to a unique sequence of characters, and back. One such encoding is ISO-8859-1, that is:

    String decoded = new String(encryptedByteArray, "ISO-8859-1");
System.out.println("decoded:" + decoded);

byte[] encoded = decoded.getBytes("ISO-8859-1");
System.out.println("encoded:" + java.util.Arrays.toString(encoded));

String decryptedText = encrypter.decrypt(encoded);

Other common encodings that don't lose data are hexadecimal and base64, but sadly you need a helper library for them. The standard API doesn't define classes for them.

With UTF-16 the program would fail for two reasons:

  1. String.getBytes("UTF-16") adds a byte-order-marker character to the output to identify the order of the bytes. You should use UTF-16LE or UTF-16BE for this to not happen.
  2. Not all sequences of bytes can be mapped to characters in UTF-16. First, text encoded in UTF-16 must have an even number of bytes. Second, UTF-16 has a mechanism for encoding unicode characters beyond U+FFFF. This means that e.g. there are sequences of 4 bytes that map to only one unicode character. For this to be possible the first 2 bytes of the 4 don't encode any character in UTF-16.

base64 to reduce digits required to encode a decimal number

Each character in Base64 can represent 6 bits, so divide your ID length by 6 to see how many characters it will be. Binary data is 8 bits per byte so it will always be shorter, but the bytes won't all be readable.

Base64 will make the ID readable, but it still won't be good if the ID needs to be hand entered, like a key. For that you'll want to restrict the character set further.

How to convert a string of bytes into an int?

You can also use the struct module to do this:

>>> struct.unpack("<L", "y\xcc\xa6\xbb")[0]
3148270713L


Related Topics



Leave a reply



Submit