Why Didn't I Get Segmentation Fault When Storing Past the End of the Bss

Why didn't I get segmentation fault when storing past the end of the BSS?

x86, like most other modern architectures, uses paging / virtual memory for memory protection. On x86 (again like many other architectures), the granularity is 4kiB.

A 4-byte store to val won't fault unless the linker happens to place it in the last 3 bytes of a page, and the next page is unmapped.

What actually happens is that you just overwrite whatever is after val. In this case, it's just unused space to the end of the page. If you had other static storage locations in the BSS, you'd step on their values. (Call them "variables" if you want, but the high-level concept of a "variable" doesn't just mean a memory location, a variable can be live in a register and never needs to have an address.)

Besides the wikipedia article linked above, see also:

How does x86 paging work? (internals of the page-table format, and how the OS manages it and the CPU reads it).
What is the state of the art in Memory Protection?
Is it safe to read past the end of a buffer within the same page on x86 and x64?
About the memory layout of programs in Linux

but actually put 2 bytes (charcode for 1 and newline symbol) into the memory location.

mov [val], eax is a 4-byte store. The operand-size is determined by the register. If you wanted to do a 2-byte store, use mov [val], ax.

Fun fact: MASM would warn or error about an operand-size mismatch, because it magically associates sizes with symbol names based on the declaration that reserves space after them. NASM stays out of your way, so if you wrote mov [val], 0x0A31, it would be an error. Neither operand implies a size, so you need mov dword [val], 0x0A31 (or word or byte).

Placing `val` at the end of a page to get a segfault

The BSS for some reason doesn't start at the beginning of a page in a 32-bit binary, but it is near the start of a page. You're not linking with anything else that would use up most of a page in the BSS. nm bss-no-segfault shows that it's at 0x080490a8, and a 4k page is 0x1000 bytes, so the last byte in the BSS mapping will be 0x08049fff.

It seems that the BSS start address changes when I add an instruction to the .text section, so presumably the linker's choices here are related to packing things into an ELF executable. It doesn't make much sense, because the BSS isn't stored in the file, it's just a base address + length. I'm not going down that rabbit hole; I'm sure there's a reason that making .text slightly larger results in a BSS that starts at the beginning of a page, but IDK what it is.

Anyway, if we construct the BSS so that val is right before the end of a page, we can get a fault:

... same .text

section .bss
dummy:  resb 4096 - 0xa8 - 2
val:    resb 1

;; could have done this instead of making up constants
;; ALIGN 4096
;; dummy2: resb 4094
;; val2:   resb

Then build and run:

$ asm-link -m32 bss-no-segfault.asm
+ yasm -felf32 -Worphan-labels -gdwarf2 bss-no-segfault.asm
+ ld -melf_i386 -o bss-no-segfault bss-no-segfault.o

peter@volta:~/src/SO$ nm bss-no-segfault
080490a7 B __bss_start
080490a8 b dummy
080490a7 B _edata
0804a000 B _end         <---------  End of the BSS
08048080 T _start
08049ffe b val          <---------  Address of val

 gdb ./bss-no-segfault

 (gdb) b _start
 (gdb) r
 (gdb) set disassembly-flavor intel
 (gdb) layout reg

 (gdb) p &val
 $2 = (<data variable, no debug info> *) 0x8049ffe
 (gdb) si    # and press return to repeat a couple times

mov [var], eax segfaults because it crosses into the unmapped page. mov [var], ax would works (because I put var 2 bytes before the end of the page).

At this point, /proc/<PID>/smaps shows:

... the r-x private mapping for .text
08049000-0804a000 rwxp 00000000 00:15 2885598                            /home/peter/src/SO/bss-no-segfault
Size:                  4 kB
Rss:                   4 kB
Pss:                   4 kB
Shared_Clean:          0 kB
Shared_Dirty:          0 kB
Private_Clean:         0 kB
Private_Dirty:         4 kB
Referenced:            4 kB
Anonymous:             4 kB
...
[vvar] and [vdso] pages exported by the kernel for fast gettimeofday / getpid

Key things: rwxp means read/write/execute, and private. Even stopped before the first instruction, somehow it's already "dirty" (i.e. written to). So is the text segment, but that's expected from gdb changing the instruction to int3.

The 08049000-0804a000 (and 4 kB size of the mapping) shows us that the BSS only has 1 page mapped. There's no data segment, just text and BSS.

Why does my program throw a segmentation fault while using heap-allocated memory?

Even though i have allocated only 1 byte of memory to the char pointer str(calloc(1,'\0')), and i copied a 18 bytes string "mystring0123456789" into it, and it didn't throw any error and the program worked fine without any SEGFAULT.

Your code had a bug -- of course it's not going to do what you expect. Fix the bug and the mystery will go away.

If the replace the statement

strcpy(str,"mystring0123456789");

with

str="mystring0123456789\0";

the program gives segmentation fault even though i have allocated enough memory for str (malloc(100)).

Because when you finish this, str points to a constant. This throws away the previous value of str, a pointer to memory you allocated, and replaces it with a pointer to that constant.

You cannot modify a constant, that's what makes it a constant. The strcpy function copies the constant into a variable which you can then modify.

Imagine if you could do this:

int* h = &2;

Now, if you did *h = 1; you'd be trying to change that constant 2 in your code, which of course you can't do.

That's effectively what you're doing with str="mystring0123456789\0";. It makes str point to that constant in your source code which, of course, you can't modify.

NASM Subroutine Call Segmentation Fault

Well, you will need debugger, as there are several problems in your code, and it's a bit too large for me to run it in head accurately (like 100% guarding stack/etc), so only few things I see:

In CHAR_CHECK: loop test the length during loop, so you don't overwrite .bss memory when somebody gives you too long string. You can move the length check right under CHAR_LOOP:, when edi is out of bounds, stop looping.

Also add the null character before storing N (swap those two mov lines), as N is stored right after X in memory, so with 31 (?) long input string you will overwrite N to 0 (this particularly is not exploitable, but the copy of long string may be).

jl/jg used in length check, but length is unsigned, so jb/ja would made more sense to me (not a bug, signed test >=1 && <= 30 will fail at the same time as unsigned one, just doesn't feel right if you have programming OCD).

good/bad char test - you can make it a bit shorter by doing only two tests ('0' <= char && char <= '2'), as ['0', '1', '2'] are values [48, 49, 50].

And now more serious stuff follows.

In I/J loop you don't reset J, so logic of your inner loop will be flawed.

push dword [X] I don't think this does what you think it does. The address of string is X, [X] is content of memory (chars of string). (this will make the sufcmp code to segfault early, when it will try to access "address" '0010', which is not legal.

In the swap, for example mov edx, dword [y + edi] ... you increment edi by 1, but Y is defined as array of dwords, so everywhere the indexing should be edi*4.

cmp esi, dword [N-1] ; if i = N-1 uhm, nope, it will compare esi with value at address N-1, so if [N] contains 16 and ahead of it is single zero byte, the cmp will compare esi with value 4096 (memory at N-1 is: 00 10 00 00 00, so [N] == 0x00000010 and [N-1] == 0x00001000).

mov eax, dword [X] ; move address of X to eax - no, lea would do what the comment says. mov will fetch content of at address X.

add eax, [y + esi] - again using +-1 indexing with dword array.

And you forget to call print_string, only new line is called.

You can rewrite that part as:

mov eax,[y + esi*4]   ; eax = Y[i]
lea eax,[X + eax]     ; eax = address X + Y[i]

And, as I'm cruel and tired, I kept the my biggest worry for last note.

I don't think this will work at all. Your bubble sort is iterating over original X string (well, it's not, but once you fix the argument issues with correct addresses, it will).

Every time. So you keep shuffling content of Y array according to original string in every iteration.

The most important part of my answer is the first sentence. You absolutely need debugger. If you felt like the language made some sense to you up till now, your source doesn't prove that. Actually I can see a glimpses of understanding, so you are basically right, but you would have to be total prodigy whizz kid to be able to pull this without debugger within reasonable time. I would grade you only as above-average, maybe good, but far away from prodigious premises.

If you still want to go without debugger, change technique. Don't write so much of code without compiling + running it. Do it by much much much smaller steps, and keep displaying all sort of things, to be sure your new 3 lines of code do what they should. For example if you would create empty stub for sufcmp just printing the string from pointer, it would segfault right after trying to access the string.

That would maybe give you better hint, than when almost final app code is segfaulting, so instead of hunting problem on recent 10 lines you have 50+ to reason about.

EDIT: algorithm suggestion:

Unless you really must use bubble sort, I would avoid that, and do the brute-force dumb "count" variant of sort.

i:[0,N): count[i] = countif(j:[0,N): X[j] < X[i])
i:[0,N): j:[0,N): if (i == count[j]) print X[j]

I hope you will be able to decipher it... it means that I would calculate for every suffix how many suffixes are "smaller" lexicographically, ie. full O(N²) loopy loop (which is in reality N^3, because comparing strings is another O(N) ... but who cares with N=30, even N⁵ would be bearable).

Then to print suffixes in correct order you simply search the count array again and again, first time for 0 smaller-count (that's the smallest one), then for 1, ... etc. Till you print all of them.

Actually you may loop through all suffixes, calculate how many are smaller, and put index of that suffix into sorted[smaller_count], so for printing you will just loop through sorted array from 0 to N-1, no searching involved.

Why do I get triple fault when trying to handle an exception on 286 but not on a modern CPU nor Bochs?

In your protected mode code you have:
```
lidt     [idtr]
mov      cx, DATA_SELECTOR
mov      es, cx
mov      ss, cx
mov      ds, cx
```
This relies on DS being set to 0x0000 prior to entering protected mode (and the corresponding base address being 0 in the DS descriptor cache) prior to doing lidt [idtr]. That instruction has an implicit DS segment. Place the lidt instruction after you set the segment registers with 16-bit selectors, not before.
Although it didn't manifest itself as a bug on your hardware, in real mode your code also relies on CS being set to 0x0000 for the instruction lgdt [cs:gdtr]. CS being 0x0000 isn't guaranteed as it is very possible for some BIOSes to use a non zero CS to reach your bootloader. For example 0x07c0:0x0000 would also reach physical address 0x07c00 (0x07c0<<4+0x0000=0x07c00). In the real mode code I'd recommend setting DS to zero and using lgdt [gdtr].
Once in protected mode and before using the stack you should set SP. Interrupts will require the stack pointer being somewhere valid. Initializing it to 0x0000 would have the stack grow down from the top of the 64KiB segment. You shouldn't rely on it happening to point somewhere that won't interfere with your running system once in protected mode (ie. on top of your bootloader code/data).
Before using any of the string instructions like STOS/SCAS/CMPS/LODS you should ensure that the Direction Flag is set as you expect it. Since you rely on forward movement you should clear the Direction Flag with CLD. You shouldn't assume that the Direction Flag is clear upon entry to your bootloader.

Many of these issues are captured in my General Bootloader Tips in another Stackoverflow answer.

What is the simplest standard conform way to produce a Segfault in C?

A segmentation fault is an implementation defined behavior. The standard does not define how the implementation should deal with undefined behavior and in fact the implementation could optimize out undefined behavior and still be compliant. To be clear, implementation defined behavior is behavior which is not specified by the standard but the implementation should document. Undefined behavior is code that is non-portable or erroneous and whose behavior is unpredictable and therefore can not be relied on.

If we look at the C99 draft standard §3.4.3 undefined behavior which comes under the Terms, definitions and symbols section in paragraph 1 it says (emphasis mine going forward):

behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements

and in paragraph 2 says:

NOTE Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).

If, on the other hand, you simply want a method defined in the standard that will cause a segmentation fault on most Unix-like systems then raise(SIGSEGV) should accomplish that goal. Although, strictly speaking, SIGSEGV is defined as follows:

SIGSEGV an invalid access to storage

and §7.14 Signal handling <signal.h> says:

An implementation need not generate any of these signals, except as a result of explicit calls to the raise function. Additional signals and pointers to undeclarable functions, with macro definitions beginning, respectively, with the letters SIG and an uppercase letter or with SIG_ and an uppercase letter,219) may also be specified by the implementation. The complete set of signals, their semantics, and their default handling is implementation-defined; all signal numbers shall be positive.

About the memory layout of programs in Linux

I'm assuming you're building this with gcc -m32 -nostartfiles segment-bounds.S or similar, so you have a 32-bit dynamic binary. (You don't need -m32 if you're actually using a 32-bit system, but most people that want to test this will have 64-bit systems.)

My 64-bit Ubuntu 15.10 system gives slightly different numbers from your program for a few things, but the overall pattern of behaviour is the same. (Different kernel, or just ASLR, explains this. The brk address varies wildly, for example, with values like 0x9354001 or 0x82a8001)

1) Why is my program starting at address 0x8048190 instead of 0x8048000?

If you build a static binary, your _start will be at 0x8048000.

We can see from readelf -a a.out that 0x8048190 is the start of the .text section. But it isn't at the start of the text segment that's mapped to a page. (pages are 4096B, and Linux requires mappings to be aligned on 4096B boundaries of file position, so with the file laid out this way, it wouldn't be possible for execve to map _start to the start of a page. I think the Off column is position within the file.)

Presumably the other sections in the text segment before the .text section are read-only data that's needed by the dynamic linker, so it makes sense to have it mapped into memory in the same page.

## part of readelf -a output
Section Headers:
  [Nr] Name              Type            Addr     Off    Size   ES Flg Lk Inf Al
  [ 0]                   NULL            00000000 000000 000000 00      0   0  0
  [ 1] .interp           PROGBITS        08048114 000114 000013 00   A  0   0  1
  [ 2] .note.gnu.build-i NOTE            08048128 000128 000024 00   A  0   0  4
  [ 3] .gnu.hash         GNU_HASH        0804814c 00014c 000018 04   A  4   0  4
  [ 4] .dynsym           DYNSYM          08048164 000164 000020 10   A  5   1  4
  [ 5] .dynstr           STRTAB          08048184 000184 00001c 00   A  0   0  1
  [ 6] .gnu.version      VERSYM          080481a0 0001a0 000004 02   A  4   0  2
  [ 7] .gnu.version_r    VERNEED         080481a4 0001a4 000020 00   A  5   1  4
  [ 8] .rel.plt          REL             080481c4 0001c4 000008 08  AI  4   9  4
  [ 9] .plt              PROGBITS        080481d0 0001d0 000020 04  AX  0   0 16
  [10] .text             PROGBITS        080481f0 0001f0 0000ad 00  AX  0   0  1         ########## The .text section
  [11] .eh_frame         PROGBITS        080482a0 0002a0 000000 00   A  0   0  4
  [12] .dynamic          DYNAMIC         08049f60 000f60 0000a0 08  WA  5   0  4
  [13] .got.plt          PROGBITS        0804a000 001000 000010 04  WA  0   0  4
  [14] .data             PROGBITS        0804a010 001010 0000d4 00  WA  0   0  1
  [15] .bss              NOBITS          0804a0e8 0010e4 0002f4 00  WA  0   0  8
  [16] .shstrtab         STRTAB          00000000 0010e4 0000a2 00      0   0  1
  [17] .symtab           SYMTAB          00000000 001188 0002b0 10     18  38  4
  [18] .strtab           STRTAB          00000000 001438 000123 00      0   0  1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings)
  I (info), L (link order), G (group), T (TLS), E (exclude), x (unknown)
  O (extra OS processing required) o (OS specific), p (processor specific)

2) Why is there a gap between the end of the text section and the start of the data section?

Why not? They have to be in different segments of the executable, so mapped to different pages. (Text is read-only and executable, and can be MAP_SHARED. Data is read-write and has to be MAP_PRIVATE. BTW, in Linux the default is for data to also be executable.)

Leaving a gap makes room for the dynamic linker to map the text segment of shared libraries next to the text of the executable. It also means an out-of-bounds array index into the data section is more likely to segfault. (Earlier and noisier failure is always easier to debug).

3) The bss start and end addresses are the same. I assume that the two buffers are stored somewhere else, is this correct?

That's interesting. They're in the bss, but IDK why the current position isn't affected by .lcomm labels. Probably they go in a different subsection before linking, since you used .lcomm instead of .comm. If I use use .skip or .zero to reserve space, I get the results you expected:

.section .bss
start_bss:
#.lcomm buffer, 500
#.lcomm buffer2, 250
buffer:  .skip 500
buffer2: .skip 250
end_bss:

.lcomm puts things in the BSS even if you don't switch to that section. i.e. it doesn't care what the current section is, and maybe doesn't care about or affect what the current position in the .bss section is. TL:DR: when you switch to the .bss manually, use .zero or .skip, not .comm or .lcomm.

4) If the system break point is at 0x83b4001, why I get the segmentation fault earlier at 0x804a000?

That tells us that there are unmapped pages between the text segment and the brk. (Your loop starts with ebx = $start_text, so it faults at the on the first unmapped page after the text segment). Besides the hole in virtual address space between text and data, there's probably also other holes beyond the data segment.

Memory protection has page granularity (4096B), so the first address to fault will always be the first byte of a page.

Why Didn't I Get Segmentation Fault When Storing Past the End of the Bss