Custom Radix Columns (+Special Characters)

Custom radix columns (+special characters)

How about using the basic base 10 to any base conversion, modified for custom digits:

func numberToCustomRadix(_ number: Int, alphabet: String) -> String {
let base = alphabet.count
var number = number
var result = ""
repeat {
let idx = alphabet.index(alphabet.startIndex, offsetBy: number % base)
result = [alphabet[idx]] + result
number /= base
} while number > 0
return result
}

numberToCustomRadix(3, alphabet: "012") // 10
numberToCustomRadix(4, alphabet: "abc") // bb
numberToCustomRadix(5, alphabet: "%#9") // #9

Note that the problem with a custom alphabet is the fact that it's hard to guarantee at compile time that the alphabet contains distinct characters. E.g. an "aaabbbccc" alphabet will generate all kind of conversion problems.


Inputting multi-radix multi-digit signed numbers with DOS

DOS has several input functions but all deal with characters exclusively.

If the number involved is small, like say 1 or 2 digits, many (new) programmers use the DOS.GetCharacter function 01h resulting in code like this:

    ; 1-digit number
mov ah, 01h ; DOS.GetCharacter
int 21h ; -> AL=["0","9"]
sub al, "0" ; -> AL=[0,9]

; 2-digit number
mov ah, 01h ; DOS.GetCharacter
int 21h ; -> AL=["0","9"] (tens)
mov bl, al
mov ah, 01h ; DOS.GetCharacter
int 21h ; -> AL=["0","9"] (ones)
mov ah, bl
sub ax, "00" ; SIMD -> AH=[0,9] (tens), AL=[0,9] (ones)
aad ; AL = AH * 10 + AL -> AL=[0,99]

This is the most basic way of inputting small numbers, but it lacks in many ways. As an example, consider what would happen to your program if the user made a mistake and accidently pressed a key for which DOS returns an extended ASCII character (a zero followed by a scancode).

Then think about the mess you would get if the above method were used to input numbers that have 3, 4, or 5 digits! Inputting a multi-digit number is best done using the DOS.BufferedInput function 0Ah. This function already gives your program a better chance at surviving since it allows keyboard users to correct their mistakes. To allow for an input of at most 5 decimal digits, the buffer that you submit to DOS could be defined with buf db 6, 0, 6 dup 0. How buffered input works has the details. Once the string of characters that represent the number has been entered, the text must get converted into a numeric value. Next code shows this:

snippet 1a

    mov  dx, buf
mov ah, 0Ah ; DOS.BufferedInput
int 21h
xor ax, ax ; Result = 0
mov si, buf+1
xor cx, cx
mov cl, [si] ; -> CX is number of characters entered
jcxz .z ; Return zero for an 'empty' input
; Decimal
.a: inc si ; Next character
mov dx, 10
mul dx ; Result = Result * 10
mov dl, [si] ; -> DX = ["0","9"] (NewDigit)
sub dl, 48 ; Convert NewDigit from ["0","9"] to [0,9]
add ax, dx ; Result = Result + NewDigit
loop .a
.z:

Sometimes you will want to input numbers in the hexadecimal, octal, or binary formats, in which case you could use next calculation loops:

snippet 2a

    ; Hexadecimal
.a: inc si ; Next character
shl ax, 1 ; Result = Result * 16
shl ax, 1
shl ax, 1
shl ax, 1
mov dl, [si] ; -> DL = {["0","9"],["A","F"]} (NewDigit)
cmp dl, "9"
jbe .b
sub dl, 7
.b: sub dl, 48
or al, dl ; Result = Result + NewDigit
loop .a

; Octal
.a: inc si ; Next character
shl ax, 1 ; Result = Result * 8
shl ax, 1
shl ax, 1
mov dl, [si] ; -> DL = ["0","7"] (NewDigit)
sub dl, 48
or al, dl ; Result = Result + NewDigit
loop .a

; Binary
.a: inc si ; Next character
cmp byte [si], "1" ; -> CF=1 for "0", CF=0 for "1"
cmc ; -> CF=0 for "0", CF=1 for "1"
rcl ax, 1 ; Result = Result * 2 + NewDigit
loop .a

Even with the editing facilities that the DOS.BufferedInput function 0Ah offers it is not ok to just trust the user at the keyboard to supply your program the correct data. It is you that has to validate the input, and if you find that something is amiss, there're a number of ways to deal with it. You could exit the program with (or without) an error message, you could have the user redo the input, you could choose to deliver some special value like the '8000h integer indefinite' that the FPU uses, or you could return a saturated result. The important thing is that you deal with the situation.

Building a better number input routine

To improve on the code that we have so far, we could

  • write the code such that the user can freely choose the number base that they want to use. All it will take is allowing the input to contain an additional numeric affix. I have always preferred the one character suffixes that Intel uses, so 'h' for hexadecimal, 'o' for octal, 'b' for binary, and 'd' or none for decimal.

  • add a further suffix in order to shorten long numbers that are multiples of 1000 ('K' for Kilo) or 1024 ('KB' for KiloByte). eg. 60K is 60000 and 6KB is 6144

  • allow the user to use the so-called 'thousands separator', and make long runs of digits become easier to read/write. A nice thing about it, is that it need not separate at the thousands at all! We can apply this to any of the number bases. FASM uses the apostrophe ' for this.

  • allow the user to use any case for the suffixes and the hexadecimal digits A through F, making the text case-insensitive.

  • allow the user to have leading whitespace in their input. Sounds silly? Well not so much if you have your inputs stored in an history of some kind, and later recall that list. You would appreciate the nice right alignment that you could get.

  • allow the user to have trailing whitespace in their input. Ask yourself whether you'd hate the program to disapprove of an input like 123 or even 20 years.

  • allow the user to prefix the number with a minus sign -, so that they can start working with negative numbers in their code.

  • extend the range of numbers that we can process. Instead of storing the result in the 16-bit AX register, we will store it in the 32-bit EAX register. If the code is to run on the 8086 cpu, then we would store in the 32-bit DX:AX register pair!

but we must

  • verify that the input is composed of valid characters so as to not spend effort processing garbage

  • detect numeric overflow so as to not deliver bogus results to the program

Applying validation and overflow detection turns snippet 1a into

snippet 1b

    mov  dx, buf
mov ah, 0Ah ; DOS.BufferedInput
int 21h
xor ax, ax ; Result = 0
mov si, buf+1
xor cx, cx
mov cl, [si] ; -> CX is number of characters entered
jcxz .z ; Return zero for an 'empty' input
; Decimal
.a: inc si ; Next character
xor bx, bx
mov bl, [si] ; -> BX = ["0","9"] (NewDigit) ?
sub bl, 48 ; Convert NewDigit from ["0","9"] to [0,9]
cmp bl, 9
ja .z ; Stop if not a digit
mov dx, 10
mul dx ; Result = Result * 10
jc .o
add ax, bx ; Result = Result + NewDigit
jc .o
loop .a
jmp .z
.o: mov ax, 65535 ; Saturated result is MAXUINT
.z:

For the hexadecimal, octal, or binary formats, substitute next loops:

snippet 2b

    ; Hexadecimal
.a: inc si ; Next character
mov dl, [si] ; -> DL = {["0","9"],["A","F"]} (NewDigit) ?
cmp dl, "9"
jbe .b
sub dl, 7
.b: sub dl, 48
cmp dl, 15
ja .z ; Stop if not a digit
rol ax, 1 ; Result = Result * 16
rol ax, 1
rol ax, 1
rol ax, 1
test al, 15
jnz .o
or al, dl ; Result = Result + NewDigit
loop .a

; Octal
.a: inc si ; Next character
mov dl, [si] ; -> DL = ["0","7"] (NewDigit) ?
sub dl, 48
cmp dl, 7
ja .z ; Stop if not a digit
rol ax, 1 ; Result = Result * 8
rol ax, 1
rol ax, 1
test al, 7
jnz .o
or al, dl ; Result = Result + NewDigit
loop .a

; Binary
.a: inc si ; Next character
mov dl, [si] ; -> DL = ["0","1"] (NewDigit) ?
sub dl, 48
cmp dl, 1
ja .z ; Stop if not a digit
shl ax, 1 ; Result = Result * 2
jc .o
or al, dl ; Result = Result + NewDigit
loop .a

The de luxe version of inputting a number applies everything that was mentioned above. It is important to note that next program will not run on a 8086 cpu because it uses 32-bit registers and instructions introduced with later processors. (For sure a nice exercise to rewrite for 8086!) The program runs in a DOS window, and of course also in the true real address mode of an x86 cpu.

The InputEAX routine sets the carry flag if the input turns out to be syntactically wrong (EAX=0), or the input leads to a value that exceeds the 32-bit range [-(4GB-1),+(4GB-1)] (EAX=80000000h).

This inputting code does not pretend to be gospel! If you don't need a certain feature, then just remove it. And if for your particular use case some feature is missing, then just add it. Leave a comment if this happens...

        ORG     256

again: mov dx, msg1
mov ah, 09h ; DOS.PrintString
int 21h
call InputEAX ; -> EAX CF
; ignoring the CF for the purpose of the demo
push ax ; (1)
mov dx, msg2
mov ah, 09h ; DOS.PrintString
int 21h
pop ax ; (1)
call PrintEAX
cmp eax, 27 ; Arbitrarily chosen, but 27 == ESC
jne again

exit: mov ax, 4C00h ; DOS.TerminateWithReturnCode
int 21h
; --------------------------------------
msg1 db 13, 10, 'Input a number : $'
msg2 db 10, 'The number is $'
; --------------------------------------
; IN (eax) OUT ()
PrintEAX:
pushad
test eax, eax
jns .a
push ax ; (1)
mov dl, "-"
mov ah, 02h ; DOS.PrintCharacter
int 21h
pop ax ; (1)
neg eax
.a: mov ebx, 10
push bx ; (2a) Sentinel
.b: xor edx, edx
div ebx
push dx ; (2b) Remainder
test eax, eax
jnz .b
pop dx ; (2)
.c: add dl, "0"
mov ah, 02h ; DOS.PrintCharacter
int 21h
pop dx
cmp dx, bx
jb .c
popad
ret
; --------------------------------------
; IN () OUT (eax,CF)
InputEAX:
xor eax, eax ; In case of CF=1 on exit
pushad
sub sp, 44+44 ; 2 local buffers
mov bp, sp
push 44 ; Buffer header 44, 0
mov dx, sp
mov ah, 0Ah ; DOS.BufferedInput
int 21h
mov si, bp ; Where the string of characters begins

; Leading whitespace
.a: lodsb
call IsWhitespace ; -> ZF
je .a
dec si

; Unary
mov al, [si]
push ax ; Possible UNARY at [bp-4]
cmp al, "+"
je .b
cmp al, "-"
jne .c
.b: inc si

; Digits followed by base-suffix, in turn for Hex, Oct, Bin, and Dec
.c: mov cx, 16+256*'H'
call GetDigits ; -> SI DI CF (AX)
jnc .d
mov cx, 8+256*'O'
call GetDigits ; -> SI DI CF (AX)
jnc .d
mov cx, 2+256*'B'
call GetDigits ; -> SI DI CF (AX)
jnc .d
mov cx, 10+256*'D'
call GetDigits ; -> SI DI CF (AX)
jc .NOK
.d: call LodsUCasedChar ; -> AL SI

; [option] K, M, G, KB, MB, GB order-suffixes
mov ebx, 1 ; Multiplier
mov ch, 3 ; ORDER
cmp al, "G" ; Giga
je .e
mov ch, 2 ; ORDER
cmp al, "M" ; Mega
je .e
mov ch, 1 ; ORDER
cmp al, "K" ; Kilo
jne .f
.e: mov bx, 1000 ; Multiplier
call LodsUCasedChar ; -> AL SI
cmp al, "B"
jne .f
mov bx, 1024 ; Multiplier
lodsb

; Trailing whitespace or end-of-input
.f: call IsWhitespace ; -> ZF
je .OK
cmp al, 13 ; Terminating carriage return
je .OK

; Failed to extract any series of digits, or excess characters in string
.NOK: stc
jmp .END

; Building the integer in EAX
.OK: mov byte [bp+44+44+31], 80h ; pushad.EAX = 80000000h (Integer
xor si, si ; indefinite in case of overflow)
xor eax, eax ; Result
.g: movzx edx, cl ; CL is RADIX {16,8,2,10}
mul edx
jc .END
movzx edx, byte [bp+44+si] ; NewDigit [0,15]
add eax, edx
jc .END
inc si
cmp si, di ; DI is NumberOfDigits
jb .g

; [option] Applying the multipliers repeatedly
.h: mul ebx ; EBX={1,1000,1024}
jc .END
dec ch ; CH is ORDER [1,3]
jnz .h

; Negating as required
cmp byte [bp-4], "-" ; UNARY
jne .CLC
neg eax ; Valid range [-(4GB-1),+(4GB-1)]
.CLC: clc

; Returning the result
mov [bp+44+44+28], eax ; pushad.EAX
.END: lea sp, [bp+44+44]
popad
ret
; --------------------------------------
; IN (al) OUT (ZF)
IsWhitespace:
cmp al, " "
je .a
cmp al, 9 ; Tab
.a: ret
; --------------------------------------
; IN (si) OUT (al,si)
LodsUCasedChar:
lodsb
cmp al, "a"
jb .a
cmp al, "z"
ja .a
and al, 1101'1111b ; UCase
.a: ret
; --------------------------------------
; IN (cx,si) OUT (si,di,CF) MOD (ax)
GetDigits:
push si ; (1)
xor di, di ; NumberOfDigits
.a: call LodsUCasedChar ; -> AL SI
cmp al, "'" ; 'Thousands' separator (apostrophe)
je .a
mov ah, al
cmp al, "0"
jb .c
cmp al, "9"
jbe .b
cmp al, "A"
jb .c
cmp al, "F"
ja .c
sub al, 7
.b: sub al, 48 ; -> AL=[0,15]
cmp al, cl ; CL is RADIX {16,8,2,10}
jnb .c
mov [bp+44+di], al
inc di
jmp .a

.c: test di, di ; Any digits found ?
jz .NOK
cmp ah, ch ; CH is BASE-SUFFIX {HOBD}
je .OK
cmp ch, "D" ; Decimals need not be suffixed
jne .NOK
dec si
.OK: ;;clc
pop ax ; (1a) This throws away `push si`
ret ; CF=0
.NOK: stc
pop si ; (1b)
ret ; CF=1
; --------------------------------------

A word on segment registers

The ORG 256 directive on top tells you that this program is a .COM program for DOS where the segment registers are all set equal to each other. If you were to use the InputEAX routine in an .EXE program that you write, you would have to temporarily set the DS segment register equal to SS because the local buffers have been
placed on the stack and normally SS will be different from DS.

; IN () OUT (eax,CF)
InputEAX:
push ds
push ss ; DS = SS
pop ds
xor eax, eax ; In case of CF=1 on exit
pushad

...

popad
pop ds
ret

What is the significance of radix in Character.fordigit() in Java?

This is tricky because the significance isn't as obvious as it first appears. When converting a string to an integer, of course the radix matters a lot. If you are converting "101" to an integer, you will get different answers depending on whether the radix (base) is binary (2), decimal (10), octal (8), hex (16), or any other base. Similarly, when converting an integer to a string, the results (when the source is >= MAX_RADIX) are all different for the different radices.

For forDigit, the answer isn't as clear. When you're converting a number to a single character representing a digit, the answer is always the same as long as the digit is valid for the radix. Thus, Character.forDigit(11,radix) always returns 'b' for all radices 12 and up. So the only significance is in how it handles the case when the digit is not valid for the radix? That is, for binary (radix=2), forDigit only works if the digit is 0 or 1; so what should it do if you say Character.forDigit(2,2), since 2 is not a valid binary digit?

There are a few things the language designers could have done: (1) get rid of the radix parameter and put the onus on the programmer to make sure the digit is in range (which in many cases will be a given anyway); (2) throw an exception; (3) return some special value. They chose (3): if you give it a digit that isn't valid for the radix, it returns '\0', the null character. This doesn't seem to be the best choice--you're unlikely to really want to use the null character for anything, which means you have to make your own check, which means they probably should have had the method throw an exception. But there it is.

But anyway, that's the significance of radix for this method: it performs a check to make sure the argument is in range, based on the radix.

How to change an integer into a string with specified character set in JavaScript

This is a variant of a "Base64" conversion question that can be answered by "base n" libraries. However, these libraries may be "overkill" for this question, so below is modified code based on a simple & elegant solution by @Reb.Cabin. Credit also to editors @callum, @Philip Kaplan, @Oka on this code.

In this response, vowels and various "problem letters" commonly used to create curse words are removed, so a random integer hash will not create an offensive short URL.

// Based on Base64 code by @Reb.Cabin, edits by @callum, @philip Kaplan, @Oka available at https://stackoverflow.com/a/6573119/3232832BaseN = {    _Rixits ://   0       8       16      24      32      40      48      56     63//   v       v       v       v       v       v       v       v      v    "0123456789BDGHJKLMNPQRTVWXYZbdghjklmnpqrtvwxyz-_",//  original base64//  "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz-_",    // You have the freedom, here, to choose the glyphs you want for     // representing your base-64 numbers.    // This cannot handle negative numbers and only works on the     //     integer part, discarding the fractional part.    fromNumber : function(number) {        if (isNaN(Number(number)) || number === null ||            number === Number.POSITIVE_INFINITY)            throw "The input is not valid";        if (number < 0)            throw "Can't represent negative numbers now";
var rixit; // like 'digit', only in some non-decimal radix var residual = Math.floor(number); var result = ''; var rixitLen = this._Rixits.length; while (true) { rixit = residual % rixitLen; result = this._Rixits.charAt(rixit) + result; residual = Math.floor(residual / rixitLen);
if (residual === 0) break; } return result; },
toNumber : function(rixits) { var result = 0; for (var e = 0; e < rixits.length; e++) { result = (result * this._Rixits.length) + this._Rixits.indexOf(rixits[e]); } return result; }};
var i = 1234567890;var encoded = BaseN.fromNumber(1234567890);var decoded = BaseN.toNumber(encoded);document.writeln('Given character set "' + BaseN._Rixits + '", the number ' + i + ' is encoded to ' + encoded + ' then back again to ' + decoded + '.');

How to convert AnyBase to Base10?

Based on the original algorithm you need to iterate through each character of the encoded string, find the location of that character within the alphabet, and calculate the new result.

Here are both methods and some test code:

func stringToCustomBase(encode: Int, alphabet: String) -> String {
var base = alphabet.count, string = encode, result = ""
repeat {
let index = alphabet.index(alphabet.startIndex, offsetBy: (string % base))
result = [alphabet[index]] + result
string /= base
} while (string > 0)
return result
}

func customBaseToInt(encoded: String, alphabet: String) -> Int? {
let base = alphabet.count
var result = 0
for ch in encoded {
if let index = alphabet.index(of: ch) {
let mult = result.multipliedReportingOverflow(by: base)
if (mult.overflow) {
return nil
} else {
let add = mult.partialValue.addingReportingOverflow(alphabet.distance(from: alphabet.startIndex, to: index))
if (add.overflow) {
return nil
} else {
result = add.partialValue
}
}
} else {
return nil
}
}

return result
}

let startNum = 234567
let alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
let codedNum = stringToCustomBase(encode: startNum, alphabet: alphabet)
let origNun = customBaseToInt(encoded: codedNum, alphabet: alphabet)

I made the customBaseToInt method return an optional result in case there are characters in the encoded value that are not in the provided alphabet.

Radix sorting strings

Yes, strings can be sorted with radix sort. In fact, radix sort is extremely effective on strings!

Given a collection of strings, you can radix sort them by first sorting the strings by their first letter (use any sorting algorithm you'd like, like counting sort or even insertion sort), breaking the strings into groups by their first letter, then recursively sorting all of the strings in each group. (This would be a most-significant-digit radix sort). You could also do a least-significant-digit radix sort. Imagine that all strings are padded up to the length of the maximum string with some special character ❤ that lexicographically precedes all the other characters. Then just do a regular LSD radix sort. When you're done, everything will be in sorted order!



Related Topics



Leave a reply



Submit