Reading from a File in Assembly

Reading from a file in assembly

you must declare your buffer in bss section and the bufsize in data

section .data
bufsize dw 1024

section .bss
buf resb 1024

read and write to file assembly

So, how I would do the code...
Thinking about it, I'm so used to Intel syntax, that I'm unable to write AT&T source from my head on the web without bugs (and I'm too lazy to actually do the real thing and debug it), so I will try to avoid writing instructions completely and just describe the process, to let you fill up the instructions.

So let's decide you want to do it char by char, version 1 of my source:

start:
; verify the command line has enough parameters, if not jump to exitToOs
; open both input and output files at the start of the code
processingLoop:
; read single char
; if no char was read (EOF?), jmp finishProcessing
; process it
; write it
jmp processingLoop
finishProcessing:
; close both input and output files
exitToOs:
; exit back to OS
  • now "run" it in your mind, verify all the major branch points make sense and will handle correctly for all major corner cases.
  • make sure you understand how the code will work, where it will loop, and where and why it will break out of loop.
  • make sure there's no infinite loop, or leaking of resources

After going trough my checklist, there's one subtle problem with this design, it's not rigorously checking file system errors, like failing to open either of the files, or writing the character (but your source doesn't care either). Otherwise I think it should work well.

So let's extend it in version 2 to be more close to real ASM instructions (asterisk marked instructions are by me, so probably with messed syntax, it's up to you to make final version of those):

start:
; verify the command line has enough parameters, if not jump to exitToOs
popl %eax # Get the number of arguments
* cmpl $3,eax ; "./binary fileinput fileoutput" will have $3 here?? Debug!
* jnz exitToOs

; open both input and output files at the start of the code
movl $5, %eax # open
popl %ebx # Get the program name

; open input file first
popl %ebx # Get the first actual argument - file to read
movl $0, %ecx # read-only mode
int $0x80
cmpl $-1, %eax ; valid file handle?
jz exitToOs
* movl %eax, ($varInputHandle) ; store input file handle to memory

; open output file, make it writable, create if not exists
movl $5, %eax # open
popl %ebx # Get the second actual argument - file to write
* ; next two lines should use octal numbers, I hope the syntax is correct
* movl $0101, %ecx # create flag + write only access (if google is telling me truth)
* movl $0666, %edx ; permissions for out file as rw-rw-rw-
int $0x80
cmpl $-1, %eax ; valid file handle?
jz exitToOs
movl %eax, ($varOutputHandle) ; store output file handle to memory

processingLoop:

; read single char to varBuffer
movl $3, %eax
movl ($varInputHandle), %ebx
movl $varBuffer, %ecx
movl $1, %edx
int $0x80

; if no char was read (EOF?), jmp finishProcessing
cmpl $0, %eax
jz finishProcessing ; looks like total success, finish cleanly

;TODO process it
* incb ($varBuffer) ; you wanted this IIRC?

; write it
movl $4, %eax
movl ($varOutputHandle), %ebx # file_descriptor
movl $varBuffer, %ecx ; BTW, still set from char read, so just for readability
movl $1, %edx ; this one is still set from char read too
int $0x80

; done, go for the next char
jmp processingLoop

finishProcessing:
movl $0, ($varExitCode) ; everything went OK, set exit code to 0

exitToOs:
; close both input and output files, if any of them is opened
movl ($varOutputHandle), %ebx # file_descriptor
call closeFile
movl ($varInputHandle), %ebx
call closeFile

; exit back to OS
movl $1, %eax
movl ($varExitCode), %ebx
int $0x80

closeFile:
cmpl $-1, %ebx
ret z ; file not opened, just ret
movl $6, %eax ; sys_close
int $0x80
; returns 0 when OK, or -1 in case of error, but no handling here
ret

.data
varExitCode: dd 1 ; no idea about AT&T syntax, "dd" is "define dword" in NASM
; default value for exit code is "1" (some error)
varInputHandle: dd -1 ; default = invalid handle
varOutputHandle: dd -1 ; default = invalid handle
varBuffer: db ? ; (single byte buffer)

Whoa, I actually wrote it fully? (of course it needs the syntax check + cleanup of asterisks, and ";" for comments, etc...)

But I mean, the comments from version 1 were already so detailed, that each required only handful of ASM instructions, so it was not that difficult (although now I see I did submit the first answer 53min ago, so this was about ~1h of work for me (including googling and a bit of other errands elsewhere)).

And I absolutely don't get how some human may want to use AT&T syntax, which is so ridiculously verbose. I can easily understand why the GCC is using it, for compilers this is perfectly fine.

But maybe you should check NASM, which is "human" oriented (to write only as few syntax sugar, as possible, and focus on instructions). The major problem (or advantage in my opinion) with NASM is Intel syntax, e.g. MOV eax,ebx puts number ebx into eax, which is Intels fault, taking LD syntax from other microprocessors manufacturers, ignoring the LD = load meaning, and changing it to MOV = move to not blatantly copy the instruction set.

Then again, I have absolutely no idea why ADD $1,%eax is the correct way in AT&T (instead of eax,1 order), and I don't even want to know, but it doesn't make any sense to me (the reversed MOV makes at least some sense due to LD origins of Intel's MOV syntax).

OTOH I can relate to cmp $number,%reg since I started to use "yoda" formatting in C++ to avoid variable value changes by accident in if (compare: if (0 = variable) vs if (variable = 0), both having typo = instead of wanted == .. the "yoda" one will not compile even with warnings OFF).

But ... oh.. this is my last AT&T ASM answer for this week, it annoys hell out of me. (I know this is personal preference, but all those additional $ and % annoys me just as much, as the reversed order).


Please, I spend serious amount of time writing this. Try to spend serious time studying it, and trying to understand it. If confused, ask in comments, but it would be pitiful waste of our time, if you would completely miss the point and not learn anything useful from this. :) So keep on.


Final note: and search hard for some debugger, find something what suits you well (probably some visual one like old "TD" from Borland in DOS days would be super nice for newcomer), but it's absolutely essential for you to improve quickly, to be able to step instruction by instruction over the code, and watch how the registers and memory content do change values. Really, if you would be able to debug your own code, you would soon realize you are reading second character from wrong file handle in %ebx... (at least I hope so).

Read file from a specific position in x86

Before performing the read you should perform a lseek, so that the file position is updated.

so something along the lines:

mov     rdi, rax        ; fd
mov rax, SYS_LSEEK
mov rsi, <whatever offset you want>
mov rdx, 0 ; keep 0 if the offset should be from the begining of the file
syscall

note: RDI will still hold the same fd value after a syscall so you don't need extra save/restore for the fd across lseek / read / close.

Tip:
It might be easier to write the code in c and compile it with gcc -g -S -fverbose-asm -Og -c main.c and then look at main.s. (How to remove "noise" from GCC/clang assembly output?). But that will only show the compiler making calls to libc wrapper functions, unless you use inline system call macros like MUSL libc provides.

AT&T - reading file

There are three errors:

First, you store eax to fd as if fd was 4 byte long, but it's actually only 1 byte long. Then you put the address of fd into ebx, not a value loaded from fd. This gives some random pointer to the system call which is of course an invalid file descriptor.

Note that adding proper error handling would have avoided at least the second problem.


In addition, as Wumpus Q. Wumbley said, you should not store a file descriptor in a single byte as its value may exceed 255. To fix both issues, first make fd a 4 byte quantity by allocating its space with .int:

fd: .int 0

this immediately fixes the first issue. To fix the second issue, load ebx from fd, not $fd:

mov fd, %ebx

The final errors are in the "print" block where you do movl $fd,%ecx and movl $len, %edx. This is equivalent to C like write(1, &fd, &len). (Again using the address of len as the length, instead of loading from memory). And you probably want to pass the address of buf (where you read the file data), not fd where you stored an integer (not ASCII). So you want write(1, buf, len), not write(1, &fd, len).

Or better, you should use the return value of read as the length for write, if it's not an error code. You don't want to write() the filler bytes. But if you do, you could make len an assemble-time constant instead of storing it in memory. Use .equ to have the assembler calculate the buffer length at assemble time so you can use mov $len, %edx


Note that after each system call, you should check if the system call was successful. You can do so by checking if the result is negative (strictly speaking, negative and its absolute value smaller than 4096). If it is, an error occured and the error code is the negated result. If you handle all your errors, it's very easy to see where and why your program failed.

You can also use the strace utility to trace the system calls your program performs, decoding args.



Related Topics



Leave a reply



Submit