How to Include Data Object Files (Images, etc.) in Program and Access the Symbols

How to include data object files (images, etc.) in program and access the symbols?

After working around and testing different things, i came back to my original approach (linking) and it worked like magic, here is the details:

In order to include data in the final executable's .data section, you need to first turn that data files (which could be an arbitrary binary file (anything!)) into a linkable file format, also known as an object file.

The tool objcopy which is included in GNU Binutils and is accessible in windows through Cygwin or MinGW, takes a file and produces an object file. objcopy requires two things to know before generating the object file, the output file format and the output architecture.
In order to determine these two things, i check a valid linkable object file with the tool objdump:

objdump -f main.o

This gives me the following information:

main.o:     file format pe-x86-64
architecture: i386:x86-64, flags 0x00000039:
HAS_RELOC, HAS_DEBUG, HAS_SYMS, HAS_LOCALS
start address 0x0000000000000000

With this knowledge now i can create the object file:

objcopy -I binary -O pe-x86-64 -B i386 data_file.data data_file_data.o

In order to handle large number of files, batch files could come in handy.

I then simply link the produced object file(s) together with my programs source and dereference the pointers that objcopy generated, through the symbols, whose names could easily be queried with:

objdump -t data_file_data.o

Which results in:

data_file_data.o:     file format pe-x86-64

SYMBOL TABLE:
[ 0](sec 1)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000000000 _binary_data_file_data_start
[ 1](sec 1)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000000006 _binary_data_file_data_end
[ 2](sec -1)(fl 0x00)(ty 0)(scl 2) (nx 0) 0x0000000000000006 _binary_data_file_data_size

Practically speaking, the following code works with GCC/G++:

extern uint8_t data[]   asm("_binary_data_file_data_start");
extern uint8_t end[] asm("_binary_data_file_data_end");

And the following with MSVC++:

extern "C" uint8_t _binary_data_file_data_start[]; // Same name as symbol
extern "C" uint8_t _binary_data_file_data_end[]; // Same name as symbol

The size of each each file is calculated with:

_binary_data_file_data_end - _binary_data_file_data_start

You could for example write the data back into a file:

FILE* file;

file = fopen("data_file_reproduced.data", "wb");
fwrite(_binary_data_file_data_start, //Pointer to data
1, //Write block size
_binary_data_file_data_end - _binary_data_file_data_start, //Data size
file);

fclose(file);

How do I add contents of text file as a section in an ELF file?

This is possible and most easily done using OBJCOPY found in BINUTILS. You effectively take the data file as binary input and then output it to an object file format that can be linked to your program.

OBJCOPY will even produce a start and end symbol as well as the size of the data area so that you can reference them in your code. The basic idea is that you will want to tell it your input file is binary (even if it is text); that you will be targeting an x86-64 object file; specify the input file name and the output file name.

Assume we have an input file called myfile.txt with the contents:

the
quick
brown
fox
jumps
over
the
lazy
dog

Something like this would be a starting point:

objcopy --input binary \
--output elf64-x86-64 \
--binary-architecture i386:x86-64 \
myfile.txt myfile.o

If you wanted to generate 32-bit objects you could use:

objcopy --input binary \
--output elf32-i386 \
--binary-architecture i386 \
myfile.txt myfile.o

The output would be an object file called myfile.o . If we were to review the headers of the object file using OBJDUMP and a command like objdump -x myfile.o we would see something like this:

myfile.o:     file format elf64-x86-64
myfile.o
architecture: i386:x86-64, flags 0x00000010:
HAS_SYMS
start address 0x0000000000000000

Sections:
Idx Name Size VMA LMA File off Algn
0 .data 0000002c 0000000000000000 0000000000000000 00000040 2**0
CONTENTS, ALLOC, LOAD, DATA
SYMBOL TABLE:
0000000000000000 l d .data 0000000000000000 .data
0000000000000000 g .data 0000000000000000 _binary_myfile_txt_start
000000000000002c g .data 0000000000000000 _binary_myfile_txt_end
000000000000002c g *ABS* 0000000000000000 _binary_myfile_txt_size

By default it creates a .data section with contents of the file and it creates a number of symbols that can be used to reference the data.

_binary_myfile_txt_start
_binary_myfile_txt_end
_binary_myfile_txt_size

This is effectively the address of the start byte, the end byte, and the size of the data that was placed into the object from the file myfile.txt. OBJCOPY will base the symbols on the input file name. myfile.txt is mangled into myfile_txt and used to create the symbols.

One problem is that a .data section is created which is read/write/data as seen here:

Idx Name          Size      VMA               LMA               File off  Algn
0 .data 0000002c 0000000000000000 0000000000000000 00000040 2**0
CONTENTS, ALLOC, LOAD, DATA

You specifically are requesting a .rodata section that would also have the READONLY flag specified. You can use the --rename-section option to change .data to .rodata and specify the needed flags. You could add this to the command line:

--rename-section .data=.rodata,CONTENTS,ALLOC,LOAD,READONLY,DATA

Of course if you want to call the section something other than .rodata with the same flags as a read only section you can change .rodata in the line above to the name you want to use for the section.

The final version of the command that should generate the type of object you want is:

objcopy --input binary \
--output elf64-x86-64 \
--binary-architecture i386:x86-64 \
--rename-section .data=.rodata,CONTENTS,ALLOC,LOAD,READONLY,DATA \
myfile.txt myfile.o

Now that you have an object file, how can you use this in C code (as an example). The symbols generated are a bit unusual and there is a reasonable explanation on the OS Dev Wiki:

A common problem is getting garbage data when trying to use a value defined in a linker script. This is usually because they're dereferencing the symbol. A symbol defined in a linker script (e.g. _ebss = .;) is only a symbol, not a variable. If you access the symbol using extern uint32_t _ebss; and then try to use _ebss the code will try to read a 32-bit integer from the address indicated by _ebss.

The solution to this is to take the address of _ebss either by using it as &_ebss or by defining it as an unsized array (extern char _ebss[];) and casting to an integer. (The array notation prevents accidental reads from _ebss as arrays must be explicitly dereferenced)

Keeping this in mind we could create this C file called main.c:

#include <stdint.h>
#include <stdlib.h>
#include <stdio.h>

/* These are external references to the symbols created by OBJCOPY */
extern char _binary_myfile_txt_start[];
extern char _binary_myfile_txt_end[];
extern char _binary_myfile_txt_size[];

int main()
{
char *data_start = _binary_myfile_txt_start;
char *data_end = _binary_myfile_txt_end;
size_t data_size = (size_t)_binary_myfile_txt_size;

/* Print out the pointers and size */
printf ("data_start %p\n", data_start);
printf ("data_end %p\n", data_end);
printf ("data_size %zu\n", data_size);

/* Print out each byte until we reach the end */
while (data_start < data_end)
printf ("%c", *data_start++);

return 0;
}

You can compile and link with:

gcc -O3 main.c myfile.o

The output should look something like:

data_start 0x4006a2
data_end 0x4006ce
data_size 44
the
quick
brown
fox
jumps
over
the
lazy
dog

A NASM example of usage is similar in nature to the C code. The following assembly program called nmain.asm writes the same string to standard output using Linux x86-64 System Calls:

bits 64
global _start

extern _binary_myfile_txt_start
extern _binary_myfile_txt_end
extern _binary_myfile_txt_size

section .text

_start:
mov eax, 1 ; SYS_Write system call
mov edi, eax ; Standard output FD = 1
mov rsi, _binary_myfile_txt_start ; Address to start of string
mov rdx, _binary_myfile_txt_size ; Length of string
syscall

xor edi, edi ; Return value = 0
mov eax, 60 ; SYS_Exit system call
syscall

This can be assembled and linked with:

nasm -f elf64 -o nmain.o nmain.asm
gcc -m64 -nostdlib nmain.o myfile.o

The output should appear as:

the
quick
brown
fox
jumps
over
the
lazy
dog

Referencing symbols of code/data loaded separately to another part of memory

You have a number of possibilities. This answer focuses on a hybrid of 1 and 2. Although you can create a table of function pointers, we can use direct calls to the routines in a common library by symbol name without copying the common library routines into each program. The method I use would be to utilize the power of LD and linker scripts to create a shared library that will have a static location in memory that is accessed via FAR CALLs (segment and offset form function address) from independent programs(s) loaded elsewhere in RAM.

Most people when they start out create a linker script that produces a copy of all the input sections in the output. It is possible to create output sections that never appear (not LOADed) in the output file but the linker can still use the symbols of those nonloaded sections to resolve symbol addresses.

I've created a simple common library with a print_banner and print_string function that use BIOS functions to print to the console. Both are assumed to be called via FAR CALL's from other segments. You may have your common library loaded at 0x0100:0x0000 (physical address 0x01000) but called from code in other segments like 0x2000:0x0000 (physical address 0x20000). A sample commlib.asm file could look like:

bits 16

extern __COMMONSEG
global print_string
global print_banner
global _startcomm

section .text

; Function: print_string
; Display a string to the console on specified display page
; Type: FAR
;
; Inputs: ES:SI = Offset of address to print
; BL = Display page
; Clobbers: AX, SI
; Return: Nothing

print_string: ; Routine: output string in SI to screen
mov ah, 0x0e ; BIOS tty Print
jmp .getch
.repeat:
int 0x10 ; print character
.getch:
mov al, [es:si] ; Get character from string
inc si ; Advance pointer to next character
test al,al ; Have we reached end of string?
jnz .repeat ; if not process next character
.end:
retf ; Important: Far return

; Function: print_banner
; Display a banner to the console to specified display page
; Type: FAR
; Inputs: BL = Display page
; Clobbers: AX, SI
; Return: Nothing

print_banner:
push es ; Save ES
push cs
pop es ; ES = CS
mov si, bannermsg ; SI = STring to print
; Far call to print_string
call __COMMONSEG:print_string
pop es ; Restore ES
retf ; Important: Far return

_startcomm: ; Keep linker quiet by defining this

section .data
bannermsg: db "Welcome to this Library!", 13, 10, 0

We need a linker script that allows us to create a file that we can eventually load into memory. This code assumes the segment the library will be loaded at is 0x0100 and offset 0x0000 (physical address 0x01000):

commlib.ld

OUTPUT_FORMAT("elf32-i386");
ENTRY(_startcomm);

/* Common Library at 0x0100:0x0000 = physical address 0x1000 */
__COMMONSEG = 0x0100;
__COMMONOFFSET = 0x0000;

SECTIONS
{
. = __COMMONOFFSET;

/* Code and data for common library at VMA = __COMMONOFFSET */
.commlib : SUBALIGN(4) {
*(.text)
*(.rodata*)
*(.data)
*(.bss)
}

/* Remove unnecessary sections */
/DISCARD/ : {
*(.eh_frame);
*(.comment);
}
}

It is pretty simple, it effectively links a file commlib.o so that it can eventually be loaded at 0x0100:0x0000. As sample program that uses this library could look like:

prog.asm:

extern __COMMONSEG
extern print_banner
extern print_string
global _start

bits 16

section .text
_start:
mov ax, cs ; DS=ES=CS
mov ds, ax
mov es, ax
mov ss, ax ; SS:SP=CS:0x0000
xor sp, sp

xor bx, bx ; BL = page 0 to display on
call __COMMONSEG:print_banner; FAR Call
mov si, mymsg ; String to display ES:SI
call __COMMONSEG:print_string; FAR Call

cli
.endloop:
hlt
jmp .endloop

section .data
mymsg: db "Printing my own text!", 13, 10, 0

The trick now is to make a linker script that can take a program like this and reference the symbols in our common library without actually adding the common library code again. This can be achieved by using the NOLOAD type on an output section in a linker script.

prog.ld:

OUTPUT_FORMAT("elf32-i386");
ENTRY(_start);

__PROGOFFSET = 0x0000;

/* Load the commlib.elf file to access all its symbols */
INPUT(commlib.elf)

SECTIONS
{
/* NOLOAD type prevents the actual code from being loaded into memory
which means if you create a BINARY file from this, this section will
not appear */
. = __COMMONOFFSET;
.commlib (NOLOAD) : {
commlib.elf(.commlib);
}

/* Code and data for program at VMA = __PROGOFFSET */
. = __PROGOFFSET;
.prog : SUBALIGN(4) {
*(.text)
*(.rodata*)
*(.data)
*(.bss)
}

/* Remove unnecessary sections */
/DISCARD/ : {
*(.eh_frame);
*(.comment);
}
}

The common library's ELF file is loaded by the linker and the .commlib section is marked with a (NOLOAD) type. This will prevent a final program from including the common library functions and data, but allows us to still reference the symbol addresses.

A simple test harness can be created as a bootloader. The bootloader will load the common library to 0x0100:0x0000 (physical address 0x01000), and the program that uses them is loaded to 0x2000:0x0000 (physical address 0x20000). The program address is arbitrary, I just picked it because it is in free memory below 1MB.

boot.asm:

org 0x7c00
bits 16

start:
; DL = boot drive number from BIOS

; Set up stack and segment registers
xor ax, ax ; DS = 0x0000
mov ds, ax
mov ss, ax ; SS:SP=0x0000:0x7c00 below bootloader
mov sp, 0x7c00
cld ; Set direction flag forward for String instructions

; Reset drive
xor ax, ax
int 0x13

; Read 2nd sector (commlib.bin) to 0x0100:0x0000 = phys addr 0x01000
mov ah, 0x02 ; Drive READ subfunction
mov al, 0x01 ; Read one sector
mov bx, 0x0100
mov es, bx ; ES=0x0100
xor bx, bx ; ES:BS = 0x0100:0x0000 = phys adress 0x01000
mov cx, 0x0002 ; CH = Cylinder = 0, CL = Sector # = 2
xor dh, dh ; DH = Head = 0
int 0x13

; Read 3rd sector (prog.bin) to 0x2000:0x0000 = phys addr 0x20000
mov ah, 0x02 ; Drive READ subfunction
mov al, 0x01 ; Read one sector
mov bx, 0x2000
mov es, bx ; ES=0x2000
xor bx, bx ; ES:BS = 0x2000:0x0000 = phys adress 0x20000
mov cx, 0x0003 ; CH = Cylinder = 0, CL = Sector # = 2
xor dh, dh ; DH = Head = 0
int 0x13

; Jump to the entry point of our program
jmp 0x2000:0x0000

times 510-($-$$) db 0
dw 0xaa55

After the bootloader loads the common library (sector 1) and program (sector 2) into memory it jumps to the entry point of the program at 0x2000:0x0000.


Putting it All Together

We can create the file commlib.bin with:

nasm -f elf32 commlib.asm -o commlib.o
ld -melf_i386 -nostdlib -nostartfiles -T commlib.ld -o commlib.elf commlib.o
objcopy -O binary commlib.elf commlib.bin

commlib.elf is also created as an intermediate file. You can create prog.bin with:

nasm -f elf32 prog.asm -o prog.o
ld -melf_i386 -nostdlib -nostartfiles -T prog.ld -o prog.elf prog.o
objcopy -O binary prog.elf prog.bin

Create the bootloader (boot.bin) with:

nasm -f bin boot.asm -o boot.bin

We can build a disk image (disk.img) that looks like a 1.44MB floppy with:

dd if=/dev/zero of=disk.img bs=1024 count=1440
dd if=boot.bin of=disk.img bs=512 seek=0 conv=notrunc
dd if=commlib.bin of=disk.img bs=512 seek=1 conv=notrunc
dd if=prog.bin of=disk.img bs=512 seek=2 conv=notrunc

This simple example can fit the common library and program in single sectors. I have also hard coded their locations on the disk. This is just a proof of concept, and not meant to represent your final code.

When I run this in QEMU (BOCHS will also work) using qemu-system-i386 -fda disk.img I get this output:

Sample Image


Looking at prog.bin

In the example above we created a prog.bin file that wasn't suppose to have the common library code in it, but had symbols to it resolved. Is that what happened? If you use NDISASM you can disassemble the binary file as 16-bit code with an origin point of 0x0000 to see what was generated. Using ndisasm -o 0x0000 -b16 prog.bin you should see something like:

; Text Section
00000000 8CC8 mov ax,cs
00000002 8ED8 mov ds,ax
00000004 8EC0 mov es,ax
00000006 8ED0 mov ss,ax
00000008 31E4 xor sp,sp
0000000A 31DB xor bx,bx
; Both the calls are to the function in the common library that are loaded
; in a different segment at 0x0100. The linker was able to resolve these
; locations for us.
0000000C 9A14000001 call word 0x100:0x11 ; FAR Call print_banner
00000011 BE2000 mov si,0x20
00000014 9A00000001 call word 0x100:0x0 ; FAR Call print_string
00000019 FA cli
0000001A F4 hlt
0000001B EBFD jmp short 0x1a ; Infinite loop
0000001D 6690 xchg eax,eax
0000001F 90 nop
; Data section
; String 'Printing my own text!', 13, 10, 0
00000020 50 push ax
00000021 7269 jc 0x8c
00000023 6E outsb
00000024 7469 jz 0x8f
00000026 6E outsb
00000027 67206D79 and [ebp+0x79],ch
0000002B 206F77 and [bx+0x77],ch
0000002E 6E outsb
0000002F 207465 and [si+0x65],dh
00000032 7874 js 0xa8
00000034 210D and [di],cx
00000036 0A00 or al,[bx+si]

I have annotated it with a few comments.


Notes

  • Is it required to use FAR Calls? No, but if you don't then all of your code will have to fit in a single segment and the offsets won't be able to overlap. Using FAR Calls comes with some overhead but they are more flexible allowing you to better utilize memory below 1MB. Functions called via a FAR Call have to use FAR Returns (retf). Far functions that use pointers passed from other segments generally need to handle segment and offset of pointers (FAR pointers), not just the offset.
  • Using the method in this answer: anytime you make a change to the common library you have to re-link all the programs that rely on it, as the absolute memory addresses for exported (public) functions and data may shift.


Related Topics



Leave a reply



Submit