Base32 Decoding

Base32 Decoding

I had a need for a base32 encoder/decoder, so I spent a couple hours this afternoon throwing this together. I believe it conforms to the standards listed here: https://www.rfc-editor.org/rfc/rfc4648#section-6.

public class Base32Encoding
{
public static byte[] ToBytes(string input)
{
if (string.IsNullOrEmpty(input))
{
throw new ArgumentNullException("input");
}

input = input.TrimEnd('='); //remove padding characters
int byteCount = input.Length * 5 / 8; //this must be TRUNCATED
byte[] returnArray = new byte[byteCount];

byte curByte = 0, bitsRemaining = 8;
int mask = 0, arrayIndex = 0;

foreach (char c in input)
{
int cValue = CharToValue(c);

if (bitsRemaining > 5)
{
mask = cValue << (bitsRemaining - 5);
curByte = (byte)(curByte | mask);
bitsRemaining -= 5;
}
else
{
mask = cValue >> (5 - bitsRemaining);
curByte = (byte)(curByte | mask);
returnArray[arrayIndex++] = curByte;
curByte = (byte)(cValue << (3 + bitsRemaining));
bitsRemaining += 3;
}
}

//if we didn't end with a full byte
if (arrayIndex != byteCount)
{
returnArray[arrayIndex] = curByte;
}

return returnArray;
}

public static string ToString(byte[] input)
{
if (input == null || input.Length == 0)
{
throw new ArgumentNullException("input");
}

int charCount = (int)Math.Ceiling(input.Length / 5d) * 8;
char[] returnArray = new char[charCount];

byte nextChar = 0, bitsRemaining = 5;
int arrayIndex = 0;

foreach (byte b in input)
{
nextChar = (byte)(nextChar | (b >> (8 - bitsRemaining)));
returnArray[arrayIndex++] = ValueToChar(nextChar);

if (bitsRemaining < 4)
{
nextChar = (byte)((b >> (3 - bitsRemaining)) & 31);
returnArray[arrayIndex++] = ValueToChar(nextChar);
bitsRemaining += 5;
}

bitsRemaining -= 3;
nextChar = (byte)((b << bitsRemaining) & 31);
}

//if we didn't end with a full char
if (arrayIndex != charCount)
{
returnArray[arrayIndex++] = ValueToChar(nextChar);
while (arrayIndex != charCount) returnArray[arrayIndex++] = '='; //padding
}

return new string(returnArray);
}

private static int CharToValue(char c)
{
int value = (int)c;

//65-90 == uppercase letters
if (value < 91 && value > 64)
{
return value - 65;
}
//50-55 == numbers 2-7
if (value < 56 && value > 49)
{
return value - 24;
}
//97-122 == lowercase letters
if (value < 123 && value > 96)
{
return value - 97;
}

throw new ArgumentException("Character is not a Base32 character.", "c");
}

private static char ValueToChar(byte b)
{
if (b < 26)
{
return (char)(b + 65);
}

if (b < 32)
{
return (char)(b + 24);
}

throw new ArgumentException("Byte is not a value Base32 value.", "b");
}

}

Why does the Apache Commons Base32 decoding return any empty array here?

I think I had completely misunderstood your question at first.

Why does the Apache Commons Base32 decoding return any empty array
here?

Answer. It's not just Apache Commons Base32 decoding library, but any well-written base32 decoding algorithm will return an empty value. Why? It's just not possible for the base32 encoding algorithm to have generated a string "F=======" as a result of encoding.

Let's understand the base32 decoding algorithm with an example of a decoded string "F8======". Note that "=" is not a real base32 character. It is just used for padding. So the actual encoded string here is "F8".

If you look at the Base32hex character map, the decimal values of F and 8 are 15 and 8 respectively which are expressed in binary as 00001111 and 00001000 respectively. As the term Base32 implies, it works in a set of 5 bits (32 = 2^5). So the same binary numbers when grouped in a set of 5 bits are expressed as 01111 and 01000 respectively. As per the algorithm, these 5-bit sets are placed together as "01111 01000" or "0111101000" without spaces. Then, this number is grouped into sets of 8 bits each from left which gives "01111010 00". Here the second set is an incomplete set since it doesn't have all 8 bits so it is discarded leaving us with a value of 01111010 which when converted to decimal gives 122. The value 122 maps to the ascii character 'z'. So the answer of decoding "F8" is "z".

Now if you apply this algorithm in your example of "F=======" which is just "F" if you discard the padding, you will get only set of "01111" which is an incomplete set because it doesn't have all the 8 bits. So an empty value is returned as a result.

Ignore a padding exception when base32 decoding using b32decode from base64 lib

Background

A base32 character contains 5 bits of data. The input to the encoder comes in the form of bytes (8 bits). This creates some awkwardness. Like when encoding one byte, you get 5+3 bits, with two bytes, you get 5+5+5+1 bits, and so on.

The only time things are not awkward is when there's 40 bits, because it will perfectly fit 5 bytes (ASCII characters) of input, and 8 base32-characters of output.

And so, the RFC4648 standard states that when things do not align, padding characters (=) are added until it does.

So, if an unpadded string is evenly divisible by 8, no action needs to be taken. Otherwise, padding characters must be added so that it aligns with the 40-bit a.k.a. 8 base32-character blocks.

Solution

last_block_width = len(unpadded_str) % 8
if last_block_width != 0:
unpadded_str += (8 - last_block_width) * '='

How do I predict the required size of a Base32 Decode output?

Base32 allows to encode each 5 bits ( as 32 = 2^5 ) using single character.

It means that you need output buffer size for encoding:

dst_size = src_size * 8 / 5 (1.6 times larger)

But as base32 string length must be multiple of 40 bits:

dst_size = (src_size * 8 + 4) / 5

Thus, for decoding (base32->binary) required buffer size is accordingly

dst_size = ceil( src_size / 1.6 )

Python base32 decode result is different

Those strings are identical. The value of the ASCII character N is 0x4E, and the last for (0x39 0x60 0x20 0x46) are "9" "space" "space" "F". When Python prints a byte string, it prints all normal characters as normal characters, and the non-printable characters show up as hex escapes. Use

import codecs
print(codecs.encode(bstr,'hex'))

to see that. The binascii module also has methods to convert to hex strings.

org.apache.commons.codec.binary.Base32 decodes to same byte array for different strings

Look here:

https://guava.dev/releases/16.0/api/docs/com/google/common/io/BaseEncoding.html

You can see that base32 is using A-Z and 2-7 as encoding. Zeros will not change anything.

How can I decode base32 to string in mysql

BASE 64 or BASE 32 are not encrypted, they are just encoded. MySQL does not have a native function to perform encoding/decoding of Base 32 strings as it has for Base 64, FROM_BASE_64 e TO_BASE_64.

As an alternative you can try the CONV mathematical function (depending on the content stored as BASE32). Lets say you have UUID numbers stored as DECIMAL and need to show them as BASE32 or vice versa:

SELECT uuid, conv(uuid, 10, 32) uuid_b32, conv(conv(uuid, 10, 32), 32, 10) 
FROM database.table;

The answer above is for number conversions between distinct base. If that is not the case, as when you have a binary file stored on a blob column, them you'll probably need to do the encoding/decoding outside MySQL. You may use MIME::Base32 or the proper module on your preferred language. Anyway, you'll need to know if the field has text or binary encoded in Base32.

Why is the result from (number).toString(32) different from other Base32 encoder implementations?

The confusion probably stems from the fact that there are two different things (albeit closely related) you could mean when you say "base 32".

Thing 1: Number radix

The way a representation of a number is structured, defining how many different options of symbols a single "digit" has. We humans usually use base 10 to represent our numbers with 10 different symbols (0-9), but you can also have binary which is base 2 (using only symbols 0 and 1), octal which is base 8 (using symbols 0-7) and so on. See base/radix on Wikipedia. With bases higher than 10 you usually use letters, starting with A for the 11th symbol. For example, hexadecimal (base 16) uses symbols 0-9 and A-F.

Since we only have 26 distinct letters that can be used in addition to the 10 symbols 0-9, in most scenarios only representations up to base 36 are defined. If you try to run 12345..toString(40) you'll get a RangeError: toString() radix argument must be between 2 and 36 for this reason.

Now, representing the number 12345 in base 32 this way (using symbols 0-9 and A-V) will give you C1P since C has the value 13, 1 has the value 1 and P has the value 25, and 13 * 32^2 + 1 * 32^1 + 25 * 32^0 equals 12345. This is just a way to write a number using 32 distinct symbols instead of 10, I wouldn't call it an "encoding" as such.

If the base is larger than 10, this will result in a shorter1 (or equally long) representation than regular base 10.

Thing 2: baseN encoding

The term of a "baseN" encoding as in "base64" (the most well-known such encoding) means the encoding of an octet stream (a stream of bytes, of 8-bit binary2 data) using an "alphabet" (a set of allowed symbols), where the N specifies how many symbols the alphabet has3. This is used to store or transmit any octet stream (regardless of its contents) on a medium that doesn't allow for the full range of possible values in a byte to be used (for instance, a medium such as an email which may only contain text, or an URL which doesn't allow for certain special characters such as / or ? because they have semantic meaning, etc. - or even a piece of paper because the symbols 0 and O as well as I and l and 1 can't be reliably used without danger for confusion between them when read back by a human).

Now comes the part that marks the relation to the first "thing": The way the conversion works can be imagined by turning the input bytes into a huge number, and changing its radix but using the alphabet the encoding defines, not necessarily digits followed by letters. A great visual explanation can be found here.

The "turning the input bytes into a huge number" part is where the charCodeAt comes into play that you mentioned: I can turn the string ABC into the number 4276803 for instance, which becomes more obvious when looking at the bytes in their hexadecimal representation because a byte can have 256 values this range fits neatly into the exact range of two hexadecimal "digits" (0x00-0xFF4). The three bytes5 in ABC have hexadecimal values of 0x65, 0x66 and 0x67 respectively, and if I put them next to each other I can look at them as a large number 0x656667 = 4276803.

An additional overlap with the first "thing" is that in cryptography, very large numbers come into play, and often those are also encoded using a mechanism like base32 or base58 or base64, but unless the programming language and/or processor have a data type/register that fits the large number, the number is at that point already represented as some sort of octet stream yet again (the inverse of what I just described, sometimes with the reverse byte order though).

Of course this is only conceptually how it's done, because otherwise the algorithm would have to cope with gigantic numbers once we are talking about encoding not 3 bytes but 3,000,000 bytes. In reality, clever ways involving bit shifting etc. are used to achieve the same result on any length of data sequentially.

Since a "string" as you are used to seeing (ignoring Unicode for a second) can be loosely compared to a byte's numerical number represented in sort of a base 256 (one symbol for each of the possible 256 values in a byte), this means that any such baseN encoding will make the output longer because the new "radix" is lower than 256. Note that putting 12345 into a base32 algorithm will mean the string 12345 which could be viewed as the number 211295614005 (or 0x3132333435) in my explanation above. Looking at it this way, 64t36d1n (what you got from base32) is definitely shorter than 211295614005 in base 10, so it all makes sense again.

Important note: This explanation isn't entirely correct if you have input data which can't be exactly mapped to its new representation without padding due to its length. For example, a 3-byte-long data chunk occupies 3*8=24 bits, and a base64 representation of it that uses 6 bits per symbol is easily possible because exactly four of these symbols would also occupy 4*6=24 bits. But a 4-byte-long data chunk occupies 4*8=32 bits and would therefore require 5.333... symbols in base64 (5.333...*6=32). To "fill up" the remaining data space, some sort of padding6 is used so we can round it up to 6 symbols. Normally, the padding is added to the end of the input data, and this is where reality differs from my "changing radix of huge number" concept above because in math you'd expect leading zeroes as padding.



Douglas Crockford's base32

To address your initial question:

Douglas Crockford's base32 algorithm is actually designed for numbers but with a modified alphabet, it doesn't take an octet stream as input as programmers are used to. So it's more like a middle ground of the two things described above. You are right that toString(32) goes half the way to what you need, but you'd have to map between the regular "alphabet" of radix 32 (0-9, A-V, case insensitive) and Crockford's (with 0-9 and A-Z but without I, O and U, case insensitive, mapping I to 1 and O to 0 when decoding).

Replacing those things back and forth is enough complexity that I guess it'd be cleaner (and more educational) to write the algorithm yourself from scratch instead of relying on toString.

(Also, Crockford proposes an additional "check symbol" at the end which goes beyond what was explained here anyway.)


Footnotes:

1: This is assuming integers. If you have fractions, then things are very different, because you can get recurring decimals in the new radix for numbers that didn't have recurring decimals in the old radix, or the other way round. For instance, 0.1 in base 32 is 0.36CPJ6CPJ6CPJ6CPJ... which is an infinitely long number (in that particular representation).

2: The term "binary" here doesn't refer to representation in radix 2 but to "any kind of data which can use the full range of values from 0-255 per byte, not restricted to values representing human-readable text in ASCII range 32-126".

3: Note that from the N alone, you can't infer what the alphabet exactly is, only how long it is. Well-known encodings have universally accepted conventions about which alphabet is used, such as base64 and base58 (the latter often being used for cryptocurrency addresses, and its alphabet is not even in alphabetical order by the way). Note that even for base64 there are variations like base64url which change the alphabet slightly. Others such as base32 don't have a universally accepted alphabet yet which is why the website that you linked mentions "this is a base32 encoding and not the base32 encoding" - notably it's not the same as Crockford's alphabet.

4: The prefix 0x is commonly used to denote that the following symbols are to be interpreted as a number in base 16 (hexadecimal) instead of base 10.

5: I'm talking about bytes here, because this is what the baseN algorithms work with, but in fact strings are based on characters and not bytes, and they may also contain Unicode characters with numerical values beyond 255, therefore not fitting into a single byte anymore. Normally, strings are first encoded using a character encoding like UTF-8 to bytes and then the baseN encoding is performed on those bytes.

6: base64 uses = as padding, and to retain the information how many padding characters were used, the same number of = characters is also appended to the output (= isn't in the alphabet of base64).

Python base32 data decode

The python base64 module follows RFC 3548. For base32 encoding,

Padding at the end of
the data is performed using the "=" character. Since all base 32
input is an integral number of octets, only the following cases can
arise:

(1) the final quantum of encoding input is an integral multiple of 40
bits; here, the final unit of encoded output will be an integral
multiple of 8 characters with no "=" padding,

(2) the final quantum of encoding input is exactly 8 bits; here, the
final unit of encoded output will be two characters followed by six
"=" padding characters,

(3) the final quantum of encoding input is exactly 16 bits; here, the
final unit of encoded output will be four characters followed by four
"=" padding characters,

(4) the final quantum of encoding input is exactly 24 bits; here, the
final unit of encoded output will be five characters followed by
three "=" padding characters, or

(5) the final quantum of encoding input is exactly 32 bits; here, the
final unit of encoded output will be seven characters followed by one
"=" padding character.

You can see that there is no valid case for RFC 3548 base32 encoding that would result in six characters and two padding characters.

Five characters gives you 25 bits total, so it is enough to encode three bytes with one extra bit. Six characters would give you 30 bits total, which is still not enough for four bytes. With seven characters you get 35 bits, which is enough for four bytes. Since six characters is no better than five for encoding an integral number of bytes, it is excluded from the standard for the final padded 40-bit input group of eight characters including padding.



Related Topics



Leave a reply



Submit