How to Specify Base Addresses for Sections When Linking or Alternatively How to Rebase a Section After Linking

How to specify base addresses for sections when linking or alternatively how to rebase a section after linking?

Judging by the question you reference and the tag of Linux, I am going to assume that you are using GNU ld.

The short answer for GNU ld is yes, sections can be placed at specific addresses.

The longer answer is that you will need to create a custom linker script to do that, which can be specified the -T for ld. If you are using gcc as a wrapper around ld, you will need pass it the linker via the gcc -Wl, option.

The linker script will have to include something like the following:

SECTIONS {
   .text 0x08049000 :
       {
       foo.o (.text)
       bar.o (.text)
       }
}

Something to watch out for though is that -T option replaces the default linker script that ld uses. You may want to modify the default linker script to do what you want. The default linker script can be dumped by passing the --verbose option to ld without any other options.

More info about linker scripts is available in the LD Manual.

Linking symbols to fixed addresses on Linux

The suggestion by litb to use --defsym symbol=address does work, but is a bit cumbersome when you have a few dozen such instances to map. However, --just-symbols=symbolfile does just the trick. It took me a while to find out the syntax of the symbolfile, which is

symbolname1 = address;
symbolname2 = address;
...

The spaces seem to be required, as otherwise ld reports file format not recognized; treating as linker script.

How does the linker generate final virtual memory addresses?

Can the initial address of main be random? as they are virtual I'm guessing it can be any value as the virtual memory will take care of the rest? i.e I can start literally from 0x1 (not 0x0 as it's reserved for null)?

The memory being virtual doesn’t mean that all of the virtual address space is yours to do with as you please. On most OSes, the executable modules (programs and libraries) need to use a subset of the address space or the loader will refuse to load them. That is highly platform-dependent of course.

So the address can be whatever you want as long as it is within the platform-specific range. I doubt that any platform would allow 0x1, not only because some platforms need the code to be aligned to something larger than a byte.

Furthermore, on many platforms the addresses are merely hints: if they can be used as-is, the loader doesn't have to relocate a given section in the binary. Otherwise, it'll move it to a block of the address space that is available. This is fairly common, e.g. on Windows, the 32-bit binaries (e.g. DLLs) have base addresses: if available, the loader can load the binary faster. So, in the hypothetical case of the "initial address" being 0x1, assuming that alignment wasn't a problem, the address will just end up being moved elsewhere in the address space.

It's also worth noting that the "initial address" is a bit of an unspecific term. The binary modules that are loaded when an executable starts, consist of something akin to sections. Each of the sections has its own base address, and possibly also internal (relative) addresses or address references that are tabulated. In addition, one or more of the executable sections will also have an "entry" address. Those addresses will be used by the loader to execute initialization code (e.g. DllMain concept on Windows) - that code always returns quickly. Eventually, one of the sections, that nothing else depends on, will have a suitably named entry point and will be the "actual" program you wrote - that one will keep running and return only when the program has been exited. At that point the control may return to the loader, which will note that nothing else is to be executed, and the process will be torn down. The details of all this are highly platform dependent - I'm only giving a high-level overview, it's not literally done that way on any particular platform.

How does the linker come up with the initial address? (again is the starting address random?)

The linker has no idea what to do by itself. When you link your program, the linker gets fed several more files that come with the platform itself. Those files are linker scripts and various static libraries needed to make the code able to start up. The linker scripts give the linker the constraints in which it can assign addresses. So it’s all highly platform-specific again. The linker can either assign the addresses in a completely deterministic fashion, ie. the same inputs produces identical output always, or it can be told to assign certain kinds of addresses at random (in a non-overlapping fashion of course). That’s known as ASLR (address space randomization).

What is the difference between load address and relocation address ?

Difference is crucial. Relocation address is addend to all relocs in section. So if it differ with load address, nothing will really work in this section -- all relocs inside section will be resolved to wrong values.

So why do we need technique like this? Not so much applications, but suppose (from here) you do have on your architecture extremely fast memory at 0x1000

Then you may take two sections to have relocation address 0x1000:

.text0 0x1000 : AT (0x4000) { o1/*.o(.text) }
__load_start_text0 = LOADADDR (.text0);
__load_stop_text0 = LOADADDR (.text0) + SIZEOF (.text0);
.text1 0x1000 : AT (0x4000 + SIZEOF (.text0)) { o2/*.o(.text) }
__load_start_text1 = LOADADDR (.text1);
__load_stop_text1 = LOADADDR (.text1) + SIZEOF (.text1);
. = 0x1000 + MAX (SIZEOF (.text0), SIZEOF (.text1));

Now at runtime go ahead and when you need text1, manage it by yourself to be copied on right address from its actual load address:

extern char __load_start_text1, __load_stop_text1;
memcpy ((char *) 0x1000, &__load_start_text1,
      &__load_stop_text1 - &__load_start_text1);

And then use it, as it was loaded here naturally. This technique is called overlays.

I think, example is pretty clear.

Is rebasing DLLs (or providing an appropriate default load address) worth the trouble?

I'd like to provide one answer myself, although the answers of Hans Passant and others are describing the tradeoffs already pretty well.

After recently fiddling with DLL base addresses in our application, I will here give my conclusion:

I think that, unless you can prove otherwise, providing DLLs with a non-default Base Address is an exercise in futility. This includes rebasing my DLLs.

For the DLLs I control, given the average application, each DLL will be loaded into memory only once anyway, so the load on the paging file should be minimal. (But see the comment of Michal Burr in another answer about Terminal Server environment.)
If DLLs are provided with a fixed base address (without rebasing) it will actually increase address space fragmentation, as sooner or later these addresses won't match anymore. In our app we had given all DLLs a fixed base address (for other legacy reasons, and not because of address space fragmentation) without using rebase.exe and this significantly increased address space fragmentation for us because you really can't get this right manually.
Rebasing (via rebase.exe) is not cheap. It is another step in the build process that has to be maintained and checked, so it has to have some benefit.
A large application will always have some DLLs loaded where the base address does not match, because of some hook DLLs (AV) and because you don't rebase 3rd party DLLs (or at least I wouldn't).
If you're using a RAM disk for the paging file, you might actually be better of if loaded DLLs get paged out :-)

So to sum up, I think that rebasing isn't worth the trouble except for special cases like the system DLLs.

I'd like to add a historical piece that I found on Old New Thing: How did Windows 95 rebase DLLs? --

When a DLL needed to be rebased, Windows 95 would merely make a note
of the DLL's new base address, but wouldn't do much else. The real
work happened when the pages of the DLL ultimately got swapped in. The
raw page was swapped off the disk, then the fix-ups were applied on
the fly to the raw page, thereby relocating it. The fixed-up page was
then mapped into the process's address space and the program was
allowed to continue.

Looking at how this process is done (read the whole thing), I personally suspect that part of the "rebasing is evil" stance dates back to the olden days of Win9x and low memory conditions.

Look, now there's a non-historical piece on Old New Thing:

How important is it nowadays to ensure that all my DLLs have non-conflicting base addresses?

Back in the day, one of the things you were exhorted to do was rebase
your DLLs so that they all had nonoverlapping address ranges, thereby
avoiding the cost of runtime relocation. Is this still important
nowadays?

...

In the presence of ASLR, rebasing your DLLs has no effect because ASLR is going to ignore your base address anyway and relocate the DLL into a location of its pseudo-random choosing.

...

Conclusion: It doesn't hurt to rebase, just in case, but understand
that the payoff will be extremely rare. Build your DLL with
/DYNAMICBASE enabled (and with /HIGHENTROPYVA for good measure)
and let ASLR do the work of ensuring that no base address collision
occurs. That will cover pretty much all of the real-world scenarios.
If you happen to fall into one of the very rare cases where ASLR is
not available, then your program will still work. It just may run a
little slower due to the relocation penalty.

... ASLR actually does a better job of avoiding collisions than manual
rebasing, since ASLR can view the system as a whole, whereas manual
rebasing requires you to know all the DLLs that are loaded into your
process, and coordinating base addresses across multiple vendors is
generally not possible.

How to Specify Base Addresses for Sections When Linking or Alternatively How to Rebase a Section After Linking