Binary Files and Os

Binary files and OS

Executable file formats for Windows (PE), Linux (ELF), OS/X etc (MACH-O), tend to be designed to solve common problems, so they all share common features. However, each platform specifies a different standard, so the files are not compatible across platforms, even if the platforms use the same type of CPU.

Executable file formats are not only used for executable files, but also libraries, which also contain code but are never run directly by the user - only loaded into memory to satisfy the needs to directly executable binaries.

Common Features of an executable file format:

One or more blocks of executable code
One or more blocks of read-only data such as text and numbers
One or more blocks of read/write data
Instructions on where to place these blocks in memory when the application is run
Instructions on what libraries (which are also in an 'executable file format') need to be loaded as well, and how they connect (link) up to this executable file.
One or more tables mapping code and data locations to strings or ids that describe them, useful for linking and debugging.

It's interesting to compare such formats to more basic formats, such as the venerable DOS .com file, which simply describes 64K of assorted 'stuff' to be loaded at the next available location, and has few of the features listed above.

Binary in this sense is used to compare them to 'source' files, which are written in text format. Binary format simply says that they are encoded in a non-text way, and doesn't really relate to the 0-and-1 sense of binary.

How does OS execute binary files in virtual memory?

This question can't be answered in general because it's totally hardware and OS dependent. However a typical answer is that the initially loaded program can be compiled as you say: Because the VM hardware gives each program its own address space, all addresses can be fixed when the program is linked. No recalculation of addresses at load time is needed.

Things get much more interesting with dynamically loaded libraries because two used by the same initially loaded program might be compiled with the same base address, so their address spaces overlap.

One approach to this problem is to require Position Independent Code in DLLs. In such code all addresses are relative to the code itself. Jumps are usually relative to the PC (though a code segment register can also be used). Data are also relative to some data segment or base register. To choose the runtime location, the PIC code itself needs no change. Only the segment or base register(s) need(s) be set whenever in the prelude of every DLL routine.

PIC tends to be a bit slower than position dependent code because there's additional address arithmetic and the PC and/or base registers can bottleneck the processor's instruction pipeline.

So the other approach is for the loader to rebase the DLL code when necessary to eliminate address space overlaps. For this the DLL must include a table of all the absolute addresses in the code. The loader computes an offset between the assumed code and data base addresses and actual, then traverses the table, adding the offset to each absolute address as the program is copied into VM.

DLLs also have a table of entry points so that the calling program knows where the library procedures start. These must be adjusted as well.

Rebasing is not great for performance either. It slows down loading. Moreover, it defeats sharing of DLL code. You need at least one copy per rebase offset.

For these reasons, DLLs that are part of Windows are deliberately compiled with non-overlapping VM address spaces. This speeds loading and allows sharing. If you ever notice that a 3rd party DLL crunches the disk and loads slowly, while MS DLLs like the C runtime library load quickly, you are seeing the effects of rebasing in Windows.

You can infer more about this topic by reading about object file formats. Here is one example.

Is there any difference between executable binary files between distributions?

All Linux distributions use the same binary format ELF, but there is still some differences:

different cpu arch use different instruction set.
the same cpu arch may use different ABI, ABI defines how to use the register file, how to call/return a routine. Different ABI can not work together.
Even on same arch, same ABI, this still does not mean we can copy one binary file in a distribution to another. Since most binary files are not statically linked, so they depends on the libraries under the distribution, which means different distribution may use different versions or different compilation configuration of libraries.

So if you want your program to run on all distribution, you may have to statically link a version that depends on the kernel's syscall only, even this you can only run a specified arch.

If you really want to run a program on any arch, then you have to compile binaries for all arches, and use a shell script to start up the right one.

How do Binary Files works? (From c++'s point of view)

Technically text files are binary, as all files are binary files really. Text files tend to only store the text characters, and binary stores any conceivable value - numbers, images, text, etc. Numbers for example, are not stored in decimal notation like "1234", they will be stored in binary using 0s and 1s only. There are a few ways to do this (depending on your operating system), so the same number could look like a different set of 0s and 1s. eg 0001110101011 etc. If you open binary files in Notepad, it tries to display everything as text, and what you see is also some garbage instead, which is the other data represented in binary.

Cracking a binary file format is knowing exactly what information is stored in each byte of the file...Sometimes text, numbers, arrays, classes, structures...Anything really. Given experience one could slowly work out what is what, but thats pretty advanced stuff!

Sometimes the information (format) is freely available and easy to follow, or a nightmare to follow like the format for a MS Word document. (MS Word format is freely available, but reputed to be insanely complicated due to backwards compatibility ...Nonetheless, having the format documentation allows you to 'crack' the binary file format and know exactly what all the binary represents)

Its one of the fundamentals of a Computer system.

Probably a great explanation in this link

http://www.cs.umd.edu/class/sum2003/cmsc311/Notes/BitOp/asciiBin.html

Some text quoted:

Although ASCII files are binary files, some people treat them as
different kinds of files. I like to think of ASCII files as special
kinds of binary files. They're binary files where each byte is written
in ASCII code.

A full, general binary file has no such restrictions. Any of the 256
bit patterns can be used in any byte of a binary file.

We work with binary files all the time. Executables, object files,
image files, sound files, and many file formats are binary files. What
makes them binary is merely the fact that each byte of a binary file
can be one of 256 bit patterns. They're not restricted to the ASCII
codes.

How can binary files and libraries built with different C++ compilers be compatible?

Often times they are incompatible. Here are some of the incompatibilities that may be found:

Different data type sizes

For example, GCC uses 16 bytes for a long double, while VS aliases it to double (8 bytes). For more information, check this.

This can be easily fixed if both sizes are known, since converting between sizes is trivial (but may result in loss of precision).

Name mangling

This is probably the most known issue. C++ compilers need to decorate symbols so that (among many things) function overloading is possible. There is no standard name mangling scheme, so compilers choose their own, which causes incompatibility, since code compiled with VS, for example, might not recognize the names in a library compiled with GCC, or even an earlier version of VS.

This can be fixed by not decorating the names, and exporting them as C functions, which are undecorated. This, however, means that class methods must be wrapped in C functions in order for them to be exported (because C doesn't understand classes), and this is not really convenient.

Visual C++ name mangling scheme

GCC name mangling scheme

How are memory addresses placed in an binary files?

A specific operating system has a specific set of rules or possibly multiple sets of rules for where a compatible program can be loaded. The toolchain including default linker script (think gcc hello.c -o hello) made for that platform conforms to these rules.

So for example I decide to create an operating system, for a platform that has an MMU. Because it has an MMU I can create the operating system such that every program sees the same (virtual) address space. So I can decide that for applications on my operating system, the memory space starts at 0x00000000 but the entry point must be 0x00001000. The binary file format supported is a Motorola s-record let's say.

So take a simple program with a simple linker script

MEMORY
{
    ram : ORIGIN = 0x1000, LENGTH = 0x10000
}
SECTIONS
{
    .text : { *(.text*) } > ram
}

The disassembly of my simple program

00001000 <_start>:
    1000:   e3a0d902    mov sp, #32768  ; 0x8000
    1004:   eb000001    bl  1010 <main>
    1008:   e3a00000    mov r0, #0
    100c:   ef000000    svc 0x00000000

00001010 <main>:
    1010:   e3a00000    mov r0, #0
    1014:   e12fff1e    bx  lr

And the "binary" file happens to be human readable:

S00F00006E6F746D61696E2E737265631F
S3150000100002D9A0E3010000EB0000A0E3000000EF1E
S30D000010100000A0E31EFF2FE122
S70500001000EA

and you may or may not notice that the address is indeed in the binary describing where things go.

Being an operating system based program that is loaded into ram we don't have to play too many games with memory, we can assume one flat all ram (read/write) so if there were .data, .bss, etc it can all be packed in there.

For a real operating system it is desirable to have the binary include additional information perhaps the size of the program. So you can google around the various common file formats and see how this is done, either a simple up front I need this much, or one to many sections individually defined. And yes, again the "binary" is more than just opcodes and data, I assume you understand that.

The toolchain I used outputs elf formatted files by default but objcopy can be used to create a number of different formats one of which is a raw memory image (which does not contain any address/location information) many/most of the rest contain the machine code and data as well as labels for the debugger/disassembler or addresses for where chunks of this data wants to live in the memory space, etc.

Now when you say embedded and use the words ROM and RAM I assume you mean bare-metal like a microcontroller for example, but even if you mean booting an x86 or full sized ARM or whatever the same things apply. In the case of an MCU the chip designers have perhaps per the rules of the processor or their own choice have determined the rules for the memory space. Just like an operating system will dictate the rules. We are cheating a bit, as a lot of the tools we use today (gnu based) are not really designed for bare-metal but because a generic compiler is a generic compiler and more important a toolchain lends itself for this kind of portability, we can use such tools. Ideally using a cross compiler meaning the output machine code is not necessarily meant to run on the computer generating that output machine code. The major difference that matters is that we want to control the linking and libraries, don't link in host operating system based libraries and let us control or for this toolchain have a default linker script that targets our MCU. So lets say I have an ARM7TDMI based MCU, and the chip designers say I need the binary such that the ROM starts at address 0x00000000 and is of some size, and RAM starts at 0x40000000 and is of some size. Being an ARM7 the processor starts execution by fetching the instruction at address 0x00000000 and the chip designers have mapped that 0x00000000 to the ROM.

So now my simple program

unsigned int xyz;
int notmain ( void )
{
    xyz=5;
    return(0);
}

linked like this

MEMORY
{
    bob : ORIGIN = 0x00000000, LENGTH = 0x1000
    ted : ORIGIN = 0x40000000, LENGTH = 0x1000
}
SECTIONS
{
    .text : { *(.text*) } > bob
    .bss : { *(.bss*) } > ted
}

gives a disassembly of this

Disassembly of section .text:

00000000 <_start>:
   0:   e3a0d101    mov sp, #1073741824 ; 0x40000000
   4:   e38dda01    orr sp, sp, #4096   ; 0x1000
   8:   eb000000    bl  10 <notmain>
   c:   eafffffe    b   c <_start+0xc>

00000010 <notmain>:
  10:   e3a02005    mov r2, #5
  14:   e59f3008    ldr r3, [pc, #8]    ; 24 <notmain+0x14>
  18:   e3a00000    mov r0, #0
  1c:   e5832000    str r2, [r3]
  20:   e12fff1e    bx  lr
  24:   40000000    andmi   r0, r0, r0

Disassembly of section .bss:

40000000 <xyz>:
40000000:   00000000    andeq   r0, r0, r0

And that would be a perfectly valid program, doesn't do much interesting, but still a perfectly valid program.

First and foremost if you leave out _start, the toolchain gives a warning but still functions just fine. (hmm, actually didn't warn that time, interesting).

arm-none-eabi-as --warn --fatal-warnings vectors.s -o vectors.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -c notmain.c -o notmain.o
arm-none-eabi-ld vectors.o notmain.o -T memmap -o notmain.elf
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy --srec-forceS3 notmain.elf -O srec notmain.srec
arm-none-eabi-objcopy notmain.elf -O binary notmain.bin

And now you have the loading issue. Each MCU is different as to how you load it what tools are available and/or you make your own tools. Ihex and srec were popular for prom programmers where you had say a separate rom next to your processor and/or the through hole mcu would get plugged into the prom programmer. raw binary images work too but can quickly get large as will show in a second. As written above there is .bss but no .data so

ls -al notmain.bin
-rwxr-xr-x 1 user user 40 Oct 21 22:05 notmain.bin

40 bytes. But if I do this for demonstration purposes, even though it wont work correctly:

unsigned int xyz=5;
int notmain ( void )
{
    return(0);
}

with

MEMORY
{
    bob : ORIGIN = 0x00000000, LENGTH = 0x1000
    ted : ORIGIN = 0x40000000, LENGTH = 0x1000
}
SECTIONS
{
    .text : { *(.text*) } > bob
    .bss : { *(.bss*) } > ted
    .data : { *(.data*) } > ted
}

gives

Disassembly of section .text:

00000000 <notmain-0x10>:
   0:   e3a0d101    mov sp, #1073741824 ; 0x40000000
   4:   e38dda01    orr sp, sp, #4096   ; 0x1000
   8:   eb000000    bl  10 <notmain>
   c:   eafffffe    b   c <notmain-0x4>

00000010 <notmain>:
  10:   e3a00000    mov r0, #0
  14:   e12fff1e    bx  lr

Disassembly of section .data:

40000000 <xyz>:
40000000:   00000005    andeq   r0, r0, r5

and

-rwxr-xr-x  1 user user 1073741828 Oct 21 22:08 notmain.bin

OUCH! 0x40000004 bytes, which was expected, I asked for a memory image I defined stuff at one address (machine code) and a few bytes at another (0x40000000) so the raw memory image has to be that whole range.

hexdump notmain.bin 
0000000 d101 e3a0 da01 e38d 0000 eb00 fffe eaff
0000010 0000 e3a0 ff1e e12f 0000 0000 0000 0000
0000020 0000 0000 0000 0000 0000 0000 0000 0000
*
40000000 0005 0000                              
40000004

Instead one would just use the elf file the toolchain generates or maybe an ihex or srecord.

S00F00006E6F746D61696E2E737265631F
S3150000000001D1A0E301DA8DE3000000EBFEFFFFEA79
S30D000000100000A0E31EFF2FE132
S3094000000005000000B1
S70500000000FA

all the info I need but not a huge file for so few bytes.

Not a hard and fast rule but moving data around is easier today (than a floppy from one computer to another with the prom programmer on it). And particularly i if you have a bundled IDE that vendor likely uses the toolchains default format, but even if not elf and other similar formats are supported and you don't have to go the route of a raw binary or an ihex or srec. But it still depends on the tool that takes the "binary" and programs it into the ROM(/FLASH) on the MCU.

Now I cheated to demonstrate the large file problem above, instead you have to do more work when it is not a ram only system. If you feel the need to have .data or desire to have .bss zeroed then you need to write or use a more complicated linker script that helps you out with the locations and boundaries. And that linker script is married to the bootstrap that uses linker generated information to perform those tasks. Basically a copy of .data needs to be preserved in non-volatile memory (ROM/FLASH) but it cant live there at runtime .data is read/write so ideally/typically you use the linker scripts language/magic to state that the .data read/write space is blah, and the flash space is boo at this address and this size so the bootstrap can copy from flash at that address for that amount of data to ram. And for .bss the linker script generates variables that we save into flash that tell the bootstrap to zero ram from this address to this address.

So operating system defines the memory space, the linker script matches that if you want the program to work. The system designers or chip designers determine the address space for something embedded and the linker script matches that. The bootstrap is married to the linker script for that build and target.

Edit

toolchain basics...

mov sp,#0x40000000
orr sp,sp,#0x1000
bl notmain
b .

unsigned int xyz;
int notmain ( void )
{
    xyz=5;
    return(0);
}

MEMORY
{
    bob : ORIGIN = 0x1000, LENGTH = 0x1000
    ted : ORIGIN = 0x2000, LENGTH = 0x1000
}
SECTIONS
{
    .text : { *(.text*) } > bob
    .bss : { *(.bss*) } > ted
}

My bootstrap, main program and linker script

arm-none-eabi-as --warn --fatal-warnings vectors.s -o vectors.o
arm-none-eabi-gcc -Wall -Werror -O2 -nostdlib -nostartfiles -ffreestanding -save-temps -c notmain.c -o notmain.o
arm-none-eabi-ld vectors.o notmain.o -T memmap -o notmain.elf
arm-none-eabi-objdump -D notmain.elf > notmain.list
arm-none-eabi-objcopy --srec-forceS3 notmain.elf -O srec notmain.srec
arm-none-eabi-objcopy notmain.elf -O binary notmain.bin

Some folks will argue and is sometimes true that compiles don't generate assembly any more. Still the sane way to do it and you will find it more often than not, as in this case...

The bootstrap makes an object which we can disassemble.

00000000 <.text>:
   0:   e3a0d101    mov sp, #1073741824 ; 0x40000000
   4:   e38dda01    orr sp, sp, #4096   ; 0x1000
   8:   ebfffffe    bl  0 <notmain>
   c:   eafffffe    b   c <.text+0xc>

It's not "linked" so the address this disassembler uses is zero based, and you can see the call to notmain is incomplete, not yet linked.

the compiler generated assembly for the C code

    .cpu arm7tdmi
    .fpu softvfp
    .eabi_attribute 20, 1
    .eabi_attribute 21, 1
    .eabi_attribute 23, 3
    .eabi_attribute 24, 1
    .eabi_attribute 25, 1
    .eabi_attribute 26, 1
    .eabi_attribute 30, 2
    .eabi_attribute 34, 0
    .eabi_attribute 18, 4
    .file   "notmain.c"
    .text
    .align  2
    .global notmain
    .type   notmain, %function
notmain:
    @ Function supports interworking.
    @ args = 0, pretend = 0, frame = 0
    @ frame_needed = 0, uses_anonymous_args = 0
    @ link register save eliminated.
    mov r2, #5
    ldr r3, .L2
    mov r0, #0
    str r2, [r3]
    bx  lr
.L3:
    .align  2
.L2:
    .word   xyz
    .size   notmain, .-notmain
    .comm   xyz,4,4
    .ident  "GCC: (15:4.9.3+svn231177-1) 4.9.3 20150529 (prerelease)"

that gets assembled into an object which we can also disassemble.

Disassembly of section .text:

00000000 <notmain>:
   0:   e3a02005    mov r2, #5
   4:   e59f3008    ldr r3, [pc, #8]    ; 14 <notmain+0x14>
   8:   e3a00000    mov r0, #0
   c:   e5832000    str r2, [r3]
  10:   e12fff1e    bx  lr
  14:   00000000    andeq   r0, r0, r0

Now not shown but that object also contains information for the global variable xyz and its size.

The linkers job is perhaps part of your confusion. It links the objects together such that the result will be sane or will work on the final destination (bare-metal or operating system).

Disassembly of section .text:

00001000 <notmain-0x10>:
    1000:   e3a0d101    mov sp, #1073741824 ; 0x40000000
    1004:   e38dda01    orr sp, sp, #4096   ; 0x1000
    1008:   eb000000    bl  1010 <notmain>
    100c:   eafffffe    b   100c <notmain-0x4>

00001010 <notmain>:
    1010:   e3a02005    mov r2, #5
    1014:   e59f3008    ldr r3, [pc, #8]    ; 1024 <notmain+0x14>
    1018:   e3a00000    mov r0, #0
    101c:   e5832000    str r2, [r3]
    1020:   e12fff1e    bx  lr
    1024:   00002000    andeq   r2, r0, r0

Disassembly of section .bss:

00002000 <xyz>:
    2000:   00000000    andeq   r0, r0, r0

I made this linker script so that you can see both .data and .bss moving around. The linker has filled in all of the .text into the 0x1000 address space and has patched in the call to notmain() as well as how to reach xyz. It has also allocated/defined the space for the xyz variable in the 0x2000 address space.

And then to your next question or confusion. It is very much up to the tools that load the system, be it the operating system loading a program into memory to be run, or programming the flash of an MCU or programming the ram of some other embedded system (like a mouse for example which you might not know some of them the firmware is downloaded from the operating system and not all of it burned into a flash /lib/firmware or other locations).

Binary files and cross platform compatibility

For the files to be binary compatible:

endianness must match (as it does for you)
bitfield packing order must be the same
sizes and signedness of types must be the same
the compiler must make the same decisions about padding and alignment

It's certainly possible for all of these conditions to be fulfilled, or for you to not happen to be hitting any cases for which they are not. At the very least, though, I'd add some sanity checks and/or sentinel members to detect problems.

Cannot send large binary files over Python socket

As mentioned in the comments conn.recv(MAXSIZE) receives at most MAXSIZE but can return less. The code assumes it always receives the amount requested. There is also no reason to base64-encode the file data; it just makes the file data much larger. Sockets are a byte stream, so just send the bytes.

The header can be delineated by a marker between it and the data. Below I've used CRLF and written the header as a single JSON line and also demonstrate sending a couple of files on the same connection:

client.py

import socket
import json

def transmit(sock, filename, author, content):
    msg = {'filename': filename, 'author': author, 'length': len(content)}
    data = json.dumps(msg, ensure_ascii=False).encode() + b'\r\n' + content
    sock.sendall(data)

client = socket.socket()
client.connect(('localhost',5000))
with client:
    with open('test.zip','rb') as f:
        content = f.read()
    transmit(client, 'test.zip', 'marc', content)
    content = b'The quick brown fox jumped over the lazy dog.'
    transmit(client, 'mini.txt', 'Mark', content)

server.py

import socket
import json
import os

os.makedirs('Downloads', exist_ok=True)

s = socket.socket()
s.bind(('',5000))
s.listen()

while True:
    c, a = s.accept()
    print('connected:', a)
    r = c.makefile('rb')   # wrap socket in a file-like object
    with c, r:
        while True:
            header_line = r.readline() # read in a full line of data
            if not header_line: break
            header = json.loads(header_line) # process the header
            print(header)
            remaining = header['length']
            with open(os.path.join('Downloads',header['filename']), 'wb') as f:
                while remaining :
                    # Unlike socket.recv() the makefile object won't return less
                    # than requested unless the socket is closed.
                    count = f.write(r.read(min(10240, remaining)))
                    if not count:  # socket closed?
                        if remaining:
                            print('Unsuccessful')
                        break
                    remaining -= count
                else:
                    print('Success')
    print('disconnected:', a)

Output:

connected: ('127.0.0.1', 14117)
{'filename': 'test.zip', 'author': 'marc', 'length': 52474063}
Success
{'filename': 'mini.txt', 'author': 'Mark', 'length': 45}
Success
disconnected: ('127.0.0.1', 14117)

Why are there so many zeros in the binary file?

A program file consists of some metadata, executable code, read-only data, data. Each of those is aligned to the page size of your system so that they can be mapped into memory. Those "large unused blocks" are just padding to bring everything to alignment. It's only looks large because your program is basically nothing.

Binary Files and Os