Simple Way to Get Filesize in X86 Assembly Language

File size in assembly

A simple code to display hexa-formatted length of DOS file (file name is hardcoded in source, edit it to existing file):

.model small
.stack 100h

.data

fname DB "somefile.ext", 0
buffer DB 100 dup (?), '$'

.code

start:
; set up "ds" to point to data segment
mov ax,@data
mov ds,ax
; open file first, to get "file handle"
mov ax,3D00h ; ah = 3Dh (open file), al = 0 (read only mode)
lea dx,[fname] ; ds:dx = pointer to zero terminated file name string
int 21h ; call DOS service
jc fileError
; ax = file handle (16b number)

; now set the DOS internal "file pointer" to the end of opened file
mov bx,ax ; store "file handle" into bx
mov ax,4202h ; ah = 42h, al = 2 (END + cx:dx offset)
xor cx,cx ; cx = 0
xor dx,dx ; dx = 0 (cx:dx = +0 offset)
int 21h ; will set the file pointer to end of file, returns dx:ax
jc fileError ; something went wrong, just exit
; here dx:ax contains length of file (32b number)

; close the file, as we will not need it any more
mov cx,ax ; store lower word of length into cx for the moment
mov ah,3Eh ; ah = 3E (close file), bx is still file handle
int 21h ; close the file
; ignoring any error during closing, so not testing CF here

; BTW, int 21h modifies only the registers specified in documentation
; that's why keeping length in dx:cx registers is enough, avoiding memory/stack

; display dx:cx file length in hexa formatting to screen
; (note: yes, I used dx:cx for storage, not cx:dx as offset for 42h service)
; (note2: hexa formatting, because it's much easier to implement than decimal)
lea di,[buffer] ; hexa number will be written to buffer
mov word ptr [di],('0' + 'x'*256) ; with C-like "0x" prefix
add di,2 ; "0x" written at start of buffer
mov ax,dx
call AxTo04Hex ; upper word converted to hexa string
mov ax,cx
call AxTo04Hex ; lower word converted to hexa string
mov byte ptr [di],'$' ; string terminator

; output final string to screen
mov ah,9
lea dx,[buffer]
int 21h

; exit to DOS with exit code 0 (OK)
mov ax,4C00h
int 21h

fileError:
mov ax,4C01h ; exit with code 1 (error happened)
int 21h

AxTo04Hex: ; subroutine to convert ax into four ASCII hexadecimal digits
; input: ax = 16b value to convert, ds:di = buffer to write characters into
; modifies: di += 4 (points beyond the converted four chars)
push cx ; save original cx to preserve it's value
mov cx,4
AxTo04Hex_singleDigitLoop:
rol ax,4 ; rotate whole ax content by 4 bits "up" (ABCD -> BCDA)
push ax
and al,0Fh ; keep only lowest nibble (4 bits) value (0-15)
add al,'0' ; convert it to ASCII: '0' to '9' and 6 following chars
cmp al,'9' ; if result is '0' to '9', just store it, otherwise fix
jbe AxTo04Hex_notLetter
add al,'A'-(10+'0') ; fix value 10+'0' into 10+'A'-10 (10-15 => 'A' to 'F')
AxTo04Hex_notLetter:
mov [di],al ; write ASCII hexa digit (0-F) to buffer
inc di
pop ax ; restore other bits of ax back for next loop
dec cx ; repeat for all four nibbles
jnz AxTo04Hex_singleDigitLoop
pop cx ; restore original cx value back
ret ; ax is actually back to it's input value here :)
end start

I tried to comment the code extensively, and to use "more straightforward" implementation of this stuff, avoiding some less common instructions, and keep the logic simple, so actually you should be able to comprehend how it works fully.

Again I strongly advise you to use debugger and go instruction by instruction slowly over it, watching how CPU state is changing, and how it correlates with my comments (note I'm trying to comment not what the instruction exactly does, as that can be found in instruction reference guide, but I'm trying to comment my human intention, why I wrote it there - in case of some mistake this gives you idea what should have been the correct output of the wrong code, and how to fix it. If comments just say what the instruction does, then you can't tell how it should be fixed).

Now if you would implement 32b_number_to_decimal_ascii formatting function, you can replace the last part of this example to get length in decimal, but that's too tricky for me to write from head, without proper debugging and testing.

Probably the simplest way which is reasonably to implement by somebody new to asm is to have table with 32b divisors for each 32b decimal digit and then do nested loop for each digits (probably skipping storage of leading zeroes, or just incrementing the pointer before printing to skip over them, that's even less complex logic of code).

Something like (pseudo code similar to C, hopefully showing the idea):

divisors  dd 1000000000, 100000000, 10000000, ... 10, 1

for (i = 0; i < divisors.length; ++i) {
buffer[i] = '0';
while (divisors[i] <= number) {
number -= divisors[i];
++digit[i];
}
}
digit[i] = '$';
// then printing as
ptr_to_print = buffer;
// eat leading zeroes
while ( '0' == ptr_to_print[0] ) ++ptr_to_print;
// but keep at least one zero, if the number itself was zero
if ('$' == ptr_to_print[0] ) --ptr_to_print;
print_it // dx = ptr_to_print, ah = 9, int 21h

And if you wonder, how do you subtract 32 bit numbers in 16 bit assembly, that's actually not that difficult (as 32b division):

; dx:ax = 32b number
; ds:si = pointer to memory to other 32b number (mov si,offset divisors)
sub ax,[si] ; subtract lower word, CF works as "borrow" flag
sbb dx,[si+2] ; subtract high word, using the "borrow" of SUB
; optionally: jc overflow
; you can do that "while (divisors[i] <= number)" above
; by subtracting first, and when overflow -> exit while plus
; add the divisor back (add + adc) (to restore "number")

Points to question update:

You don't convert hex to decimal (hex string is stored in buffer, you don't load anything from there). You convert value in ax to decimal. The ax contains low word of file length from previous hex conversion call. So for files of length up to 65535 (0xFFFF = maximum 16b unsigned integer) it may work. For longer files it will not, as upper word is in dx, which you just destroy by mov dx,0.

If you would actually keep dx as is, you would divide file length by 10, but for file with 655360+ length it would crash on divide error (overflow of quotient). As I wrote in my answer above, doing 32b / 16b division on 8086 is not trivial, and I'm not even sure what is the efficient way. I gave you hint about using table of 32b divisors, and doing the division by subtraction, but you went for DIV instead. That would need some sophisticated split of the original 32b value into smaller parts up to a point where you can use div bx=10 to extract particular digits. Like doing filelength/1e5 first, then calculate 32b remainder (0..99999) value, which can be actually divided by 10 even in 16b (99999/10 = 9999 (fits 16b), remainder 9).

Looks like you didn't understand why 128k file length needs 32 bits to store, and what are the effective ranges of various types of variables. 216 = 65536 (= 64ki) ... that how big your integers can get, before you run into problems. 128ki is two times over that => 16 bit is problem.

Funny thing... as you wrote "converting from hex to decimal", at first I though: what, you convert that hexa string into decimal string??? But actually that sounds doable with 16b math, to go through whole hexa number first picking up only 100 values (extracted from particular k*16n value), then in next iteration doing 101 counting, etc...

But that division by subtracting 32bit numbers from my previous answer should be much easier to do, and especially to comprehend, how it works.

You write the decimal string at address si, but I don't see how you set si, so it's probably pointing into your MENU string by accident, and you overwrite that memory (again using debugger, checking ds:si values to see what address is used, and using memory view to watch the memory content written would give you hint, what is the problem).

Basically you wasted many hours by not following my advices (learning debugging and understanding what I meant by 32b - 32b loop doing division), trying to copy some finished code from Internet. At least it looks like you can somewhat better connect it to your own code, but you are still missing obvious problems, like not setting si to point to destination for decimal string.

Maybe try to first to print all numbers from the file, and keep the size in hexa (at least try to figure out, why conversion to hexa is easy, and to decimal not). So you will have most of the task done, then you can play with the hardest part (32b to decimal in 16b asm).

BTW, just a day ago or so somebody had problem with doing addition/subtraction over 64b numbers in 16b assembly, so this answer may give you further hints, why doing those conversion by sub/add loops is not that bad idea, it's quite "simple" code if you get the idea how it works: https://stackoverflow.com/a/42645266/4271923

Assembly and c-language - a comparison of filesizes

This overhead is likely caused by the compiler's default runtime-library being included, since you're using it to call printf(). Note that printf() is way more capable a function than the DOS interrupt you're calling. All that capability of course means it consists of way more code. You could try switching printf() to puts().

I'm not saying that printf() alone is 102 KB, it probably is far from that, but you also get the entire library, it's support code (init/de-init, exit-handlers, and so on) and not just that one function.

Assembly x86 read from any file and escape all special characters and get size of file in Bytes

I found these problems in the code you presented:

mov buffer[bx],'$'          ;after last read byte put '$' into buffer

You should enlarge the buffer by 1 byte. Now you are writing this $ past the buffer when 32768 bytes were read!

add word ptr[filesiz],ax    ;add number of bytes read into filesiz variable

The previous line will not update the dword variable filesiz! Use the following

add word ptr[filesiz],ax
adc word ptr[filesiz]+2,0

ps. You don't ever check if DOS reports an error. You should not neglect this when accessing files!

read file in assembly x86 ia-32

For your first question, I can see a couple of fairly obvious possibilities, but the question doesn't contain enough information to be sure which is likely to be accurate.

The first possibility would be that the size in edi gets used for some other purpose later in the code, so the move to edi has accomplished something useful, but we can't see exactly what here, because we can't see that other code that uses it.

The other obviously possibility would be that it's simply a mistake.

There are a few less obvious possibilities, such as the mov ecx, edi being used as an entry point from some other code, so if you start from the beginning of this code, it uses the value from esp, but there's other code that loads some other value into edi then jumps to the mov ecx, edi, thus using a different value instead of what's in esp.

There are some other possibilities as well, such as somebody basically inserting the equivalent of some NOPs to (for example) get some part of the code aligned to some boundary, but without as many lines of distraction as if they'd written NOP (say) 5 times.

For your second question, 0xffff in edx basically means it'll read up to 65535 bytes from the file. Most likely they allocated a 65535-byte buffer, so they don't want to read any more than that in a single call.

Edit (after complete code was added to question). Okay, now that we can see the complete code, we can start with the fact that the code is (to be as nice about it as possible) quite unconventionally written1.

He starts by jumping to call_rw, then (obviously enough) calling rw from there. This pushes the address immediately after call_rw onto the stack. Then at rw, he pops that return value off the stack into ebx. This loads the address of message into ebx, then uses it as a parameter in the next system call (function 5, which opens a file, expecting ebx to contain a pointer to the name of the file).

Offhand, I'd just about have to guess that the code is either a deliberate (but fairly ineffective) attempt at obfuscation, or else the result of a compiler that internally produces some sort of stack-oriented internal code, then does a really lousy job of translating that to register-oriented object code. Or perhaps my first impression (see footnote below) was correct.

After removing the cruft, the first couple of sys-calls work out to something on this general order:

; open the file
mov eax, 5
mov ebx, offset filename
xor ecx, ecx
int 0x80

; read the file
mov ebx, eax
mov eax, 3
mov ecx, esp
mov edx, 0xffff
int 0x80

Sorry, but I'm too lazy to sort out all the rest. At first glance, it looks like it goes into an infinite loop (the code before call_rw flows into call_rw, calling rw again). Some of its gymnastics may prevent that from actually happening, but without a convincing argument for the need to do so, I lack the motivation to sort out more of this particular mess.


  1. I'm working really hard at being diplomatic here. Before revision, this referred to the author as "a certifiable psychotic."


Related Topics



Leave a reply



Submit