How to Read a File Using Shellcode Without Explicitly Mentioning Syscall (0X0F05) with Write Permissions Disabled

How to use a syscall in a shellcode without use `syscall` or `sysenter` for Linux x86-64, avoiding 0x0F bytes?

After some work here, I find one solution.
The idea behind was to construct a shellcode that has the capability to change himself, modify some bytes of machine code before they execute.

So what I did was to load rip into a register and put some bytes after. Then I change those bytes to \x0f\x05 and in this way, I finally executed my shellcode. I could have use a RIP-relative store instead of a RIP-relative LEA, after getting the desired bytes into a register (with mov + xor or shift, or various other ways.)

Linux Shellcode Hello, World!

When you inject this shellcode, you don't know what is at message:

mov ecx, message

in the injected process, it can be anything but it will not be "Hello world!\r\n" since it is in the data section while you are dumping only the text section. You can see that your shellcode doesn't have "Hello world!\r\n":

"\xb8\x04\x00\x00\x00"
"\xbb\x01\x00\x00\x00"
"\xb9\x00\x00\x00\x00"
"\xba\x0f\x00\x00\x00"
"\xcd\x80\xb8\x01\x00"
"\x00\x00\xbb\x00\x00"
"\x00\x00\xcd\x80";

This is common problem in shellcode development, the way to work around it is this way:

global _start

section .text

_start:
    jmp MESSAGE      ; 1) lets jump to MESSAGE

GOBACK:
    mov eax, 0x4
    mov ebx, 0x1
    pop ecx          ; 3) we are poping into `ecx`, now we have the
                     ; address of "Hello, World!\r\n" 
    mov edx, 0xF
    int 0x80

    mov eax, 0x1
    mov ebx, 0x0
    int 0x80

MESSAGE:
    call GOBACK       ; 2) we are going back, since we used `call`, that means
                      ; the return address, which is in this case the address 
                      ; of "Hello, World!\r\n", is pushed into the stack.
    db "Hello, World!", 0dh, 0ah

section .data

Now dump the text section:

$ nasm -f elf shellcode.asm
$ ld shellcode.o -o shellcode
$ ./shellcode 
Hello, World!
$ objdump -d shellcode

shellcode:     file format elf32-i386

Disassembly of section .text:

08048060 <_start>:
 8048060:   e9 1e 00 00 00   jmp    8048083 <MESSAGE>

08048065 <GOBACK>:
 8048065:   b8 04 00 00 00   mov    $0x4,%eax
 804806a:   bb 01 00 00 00   mov    $0x1,%ebx
 804806f:   59               pop    %ecx
 8048070:   ba 0f 00 00 00   mov    $0xf,%edx
 8048075:   cd 80            int    $0x80
 8048077:   b8 01 00 00 00   mov    $0x1,%eax
 804807c:   bb 00 00 00 00   mov    $0x0,%ebx
 8048081:   cd 80            int    $0x80

08048083 <MESSAGE>:
 8048083:   e8 dd ff ff ff   call   8048065 <GOBACK>
 8048088:   48               dec    %eax                    <-+
 8048089:   65               gs                               |
 804808a:   6c               insb   (%dx),%es:(%edi)          |
 804808b:   6c               insb   (%dx),%es:(%edi)          |
 804808c:   6f               outsl  %ds:(%esi),(%dx)          |
 804808d:   2c 20            sub    $0x20,%al                 |
 804808f:   57               push   %edi                      |
 8048090:   6f               outsl  %ds:(%esi),(%dx)          |
 8048091:   72 6c            jb     80480ff <MESSAGE+0x7c>    |
 8048093:   64               fs                               |
 8048094:   21               .byte 0x21                       |
 8048095:   0d               .byte 0xd                        |
 8048096:   0a               .byte 0xa                      <-+

$

The lines I marked are our "Hello, World!\r\n" string:

$ printf "\x48\x65\x6c\x6c\x6f\x2c\x20\x57\x6f\x72\x6c\x64\x21\x0d\x0a"
Hello, World!

$

So our C wrapper will be:

char code[] = 

    "\xe9\x1e\x00\x00\x00"  //          jmp    (relative) <MESSAGE>
    "\xb8\x04\x00\x00\x00"  //          mov    $0x4,%eax
    "\xbb\x01\x00\x00\x00"  //          mov    $0x1,%ebx
    "\x59"                  //          pop    %ecx
    "\xba\x0f\x00\x00\x00"  //          mov    $0xf,%edx
    "\xcd\x80"              //          int    $0x80
    "\xb8\x01\x00\x00\x00"  //          mov    $0x1,%eax
    "\xbb\x00\x00\x00\x00"  //          mov    $0x0,%ebx
    "\xcd\x80"              //          int    $0x80
    "\xe8\xdd\xff\xff\xff"  //          call   (relative) <GOBACK>
    "Hello wolrd!\r\n";     // OR       "\x48\x65\x6c\x6c\x6f\x2c\x20\x57"
                            //          "\x6f\x72\x6c\x64\x21\x0d\x0a"

int main(int argc, char **argv)
{
    (*(void(*)())code)();

    return 0;
}

Lets test it, using -z execstack to enable read-implies-exec (process-wide, despite "stack" in the name) so we can executed code in the .data or .rodata sections:

$ gcc -m32 test.c -z execstack -o test
$ ./test 
Hello wolrd!

It works. (-m32 is necessary, too, on 64-bit systems. The int $0x80 32-bit ABI doesn't work with 64-bit addresses like .rodata in a PIE executable. Also, the machine code was assembled for 32-bit. It happens that the same sequence of bytes would decode to equivalent instructions in 64-bit mode but that's not always the case.)

Modern GNU ld puts .rodata in a separate segment from .text, so it can be non-executable. It used to be sufficient to use const char code[] to put executable code in a page of read-only data. At least for shellcode that doesn't want to modify itself.

Why does the amount of NOPs seem to impact whether shellcode is executed successfully?

Jester's guess that the shellcode's push operations overwrite the instructions at the far end of the shell code regarding my second example was correct:

Checking the current instruction after receiving the SIGILL by setting set disassemble-next-line on and repeating the second example yields

Program received signal SIGILL, Illegal instruction.
0x00007fffffffd8ea in ?? ()
=> 0x00007fffffffd8ea:  ff  (bad)

The NOP (90) which was at this address previously has been overwritten by ff.

How does this happen? Repeat the second example again and additionally set b 8. At this point in time, the buffer has not been overflown yet.

(gdb) info frame 0
[...]
Saved registers:
  rbp at 0x7fffffffd8f0, rip at 0x7fffffffd8f8

The bytes starting at 0x7fffffffd8f8 contain the address which will be returned to after having left the function function. Then, this 0x7fffffffd8f8 address will also be the address from which stack will continue to grow again (there, the first 8 bytes will be stored). Indeed, continuing with gdb and using the si command shows that before the first push instruction of the shellcode the stack pointer points to 0x7fffffffd900:

(gdb) si
0x00007fffffffd8da in ?? ()
=> 0x00007fffffffd8da:  53      push   %rbx
(gdb) x/8x $sp
0x7fffffffd900: 0xf8    0xd9    0xff    0xff    0xff    0x7f    0x00    0x00

... and when the push instruction is executed the bytes are stored at address 0x7fffffffd8f8:

(gdb) si
0x00007fffffffd8db in ?? ()
=> 0x00007fffffffd8db:  48 89 e7        mov    %rsp,%rdi
(gdb) x/8bx $sp
0x7fffffffd8f8: 0x2f    0x62    0x69    0x6e    0x2f    0x73    0x68    0x00

Continuing with this, one can see that after the last push instruction of the shellcode the content of push is pushed on the stack at address 0x7fffffffd8e8:

0x00007fffffffd8e3 in ?? ()
=> 0x00007fffffffd8e3:  57      push   %rdi
0x00007fffffffd8e4 in ?? ()
=> 0x00007fffffffd8e4:  48 89 e6        mov    %rsp,%rsi
(gdb) x/8bx $sp
0x7fffffffd8e8: 0xf8    0xd8    0xff    0xff    0xff    0x7f    0x00    0x00

However, this is also the place where the last byte for the instruction of syscall is stored (see the x/80bx buff output in the question for the second example). Therefore, the syscall and thus the shellcode cannot be executed successfully. This doesn't happen in the first example since then the bytes pushed onto the stack grow right til the end of the shellcode (without overriding a byte of it): 8 bytes for the 8 NOPs ("\x90"x8) + 8 bytes for the saved base pointer + 8 bytes for the return address provide enough space for the 3 push operations.

How do I compile C without anything but my code in the binary?

Shortly:

strip -s does not remove the sections but only overrides them with 0 (and thus the file size remains the same.
There are a lot of program headers that we do not need in this case (in order to handle exceptions etc.)
There is a default alignment in the binary, which makes the start of it be at least 4000 (and we do not need it).

Detailed

First, we can improve it slightly if we compile the binary statically:

$ gcc -nostdlib -static nolib.c -o static_output
$ strip -s static_output  # strip -s in order to strip all (not helping here)
$ ls -lh static_output
-rwxrwxrwx 1 graul graul 8.7K Jan 17 22:59 static_output

Lets look over our elf now:
$ readelf -h static_output
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x401000
Start of program headers: 64 (bytes into file)
Start of section headers: 8368 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 7
Size of section headers: 64 (bytes)
Number of section headers: 7
Section header string table index: 6

Looks like there is more than 8kn before the start of sections header!
Let's look at what this is made of:

$ readelf -e static_output
ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              EXEC (Executable file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x401000
  Start of program headers:          64 (bytes into file)
  Start of section headers:          8368 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         7
  Size of section headers:           64 (bytes)
  Number of section headers:         7
  Section header string table index: 6

Section Headers:
  [Nr] Name              Type             Address           Offset
       Size              EntSize          Flags  Link  Info  Align
  [ 0]                   NULL             0000000000000000  00000000
       0000000000000000  0000000000000000           0     0     0
  [ 1] .note.gnu.propert NOTE             00000000004001c8  000001c8
       0000000000000020  0000000000000000   A       0     0     8
  [ 2] .note.gnu.build-i NOTE             00000000004001e8  000001e8
       0000000000000024  0000000000000000   A       0     0     4
  [ 3] .text             PROGBITS         0000000000401000  00001000
       000000000000001b  0000000000000000  AX       0     0     1
  [ 4] .eh_frame         PROGBITS         0000000000402000  00002000
       0000000000000038  0000000000000000   A       0     0     8
  [ 5] .comment          PROGBITS         0000000000000000  00002038
       000000000000002a  0000000000000001  MS       0     0     1
  [ 6] .shstrtab         STRTAB           0000000000000000  00002062
       000000000000004a  0000000000000000           0     0     1
Key to Flags:
  W (write), A (alloc), X (execute), M (merge), S (strings), I (info),
  L (link order), O (extra OS processing required), G (group), T (TLS),
  C (compressed), x (unknown), o (OS specific), E (exclude),
  l (large), p (processor specific)

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x000000000000020c 0x000000000000020c  R      0x1000
  LOAD           0x0000000000001000 0x0000000000401000 0x0000000000401000
                 0x000000000000001b 0x000000000000001b  R E    0x1000
  LOAD           0x0000000000002000 0x0000000000402000 0x0000000000402000
                 0x0000000000000038 0x0000000000000038  R      0x1000
  NOTE           0x00000000000001c8 0x00000000004001c8 0x00000000004001c8
                 0x0000000000000020 0x0000000000000020  R      0x8
  NOTE           0x00000000000001e8 0x00000000004001e8 0x00000000004001e8
                 0x0000000000000024 0x0000000000000024  R      0x4
  GNU_PROPERTY   0x00000000000001c8 0x00000000004001c8 0x00000000004001c8
                 0x0000000000000020 0x0000000000000020  R      0x8
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10

 Section to Segment mapping:
  Segment Sections...
   00     .note.gnu.property .note.gnu.build-id
   01     .text
   02     .eh_frame
   03     .note.gnu.property
   04     .note.gnu.build-id
   05     .note.gnu.property
   06

This is weird, as we called the strip function which was supposed to remove this section from our elf. If we look over the response in https://unix.stackexchange.com/questions/267070/why-doesnt-strip-remove-section-headers-from-elf-executables
We can see that though not mentioned specifically, strip does not remove these parts from our binary, but only removes their content (which does not help for our case).

We can use strip -R in order to remove these sections completely, the biggest one here is the ".eh_frame" segment (which is not needed for our case, look over Why GCC compiled C program needs .eh_frame section? to look over it).

$ strip -R .eh_frame static_output
$ ls -lh static_output
-rwxrwxrwx 1 graul graul 4.6K Jan 17 23:22 static_output*

Just to be clear, there is no reason to not strip the rest of the unwanted sections as well:

$ strip -R .eh_frame -R .note.gnu.property -R .note.gnu.build-id -R .note.gnu.property static_output
-rwxrwxrwx 1 graul graul 4.4K Jan 17 23:31 static_output

Half the size! But still not good enough. looks like there is a big program header we need to remove.

looks like gcc inserts these sections without our desire:

$ gcc -c -nostdlib -static nolib.c -o nolib.o
$ ls -l nolib.o
-rwxrwxrwx 1 graul graul 1376 Jan 17 23:40 nolib.o
$ strip -R .data -R .bss -R .comment -R .note.GNU-stack -R .note.GNU-stack -R .note.gnu.propery -R .eh_frame -R .real.eh_frame -R .symtab -R.strtab -R.shstrtab nolib.o
$ ls -l nolib.o
-rwxrwxrwx 1 graul graul 424 Jan 17 23:41 nolib.o

But this is not an elf, if we run now

$ld nolib.o -o ld_output
$ls -l ld_output
-rwxrwxrwx 1 graul graul 4760 Jan 17 23:55 ld_output

In the program ld there is a flag to remove the alignment between our sections (which is almost all of our size).

$ ld -n -static nolib.o -o ld_output
$ls -l ld_output
-rwxrwxrwx 1 graul graul 928 Jan 17 23:57 ld_output
$strip -R .note.gnu.property ld_output
$ls -l ld_output
-rwxrwxrwx 1 graul graul 472 Jan 17 23:58 ld_output

Which is a drastic improvement (though of course a lot of more work could be done).

Why do x86-64 Linux system calls modify RCX, and what does the value mean?

The system call return value is in rax, as always. See What are the calling conventions for UNIX & Linux system calls on i386 and x86-64.

Note that sys_brk has a slightly different interface than the brk / sbrk POSIX functions; see the C library/kernel differences section of the Linux brk(2) man page. Specifically, Linux sys_brk sets the program break; the arg and return value are both pointers. See Assembly x86 brk() call use. That answer needs upvotes because it's the only good one on that question.

The other interesting part of your question is:

I do not quite understand the value in the rcx register in this case

You're seeing the mechanics of how the syscall / sysret instructions are designed to allow the kernel to resume user-space execution but still be fast.

syscall doesn't do any loads or stores, it only modifies registers. Instead of using special registers to save a return address, it simply uses regular integer registers.

It's not a coincidence that RCX=RIP and R11=RFLAGS after the kernel returns to your user-space code. The only way for this not to be the case is if a ptrace system call modified the process's saved rcx or r11 value while it was inside the kernel. (ptrace is the system call gdb uses). In that case, Linux would use iret instead of sysret to return to user space, because the slower general-case iret can do that. (See What happens if you use the 32-bit int 0x80 Linux ABI in 64-bit code? for some walk-through of Linux's system-call entry points. Mostly the entry points from 32-bit processes, not from syscall in a 64-bit process, though.)

Instead of pushing a return address onto the kernel stack (like int 0x80 does), syscall:

sets RCX=RIP, R11=RFLAGS (so it's impossible for the kernel to even see the original values of those regs before you executed syscall).
masks RFLAGS with a pre-configured mask from a config register (the IA32_FMASK MSR). This lets the kernel disable interrupts (IF) until it's done swapgs and setting rsp to point to the kernel stack. Even with cli as the first instruction at the entry point, there'd be a window of vulnerability. You also get cld for free by masking off DF so rep movs / stos go upward even if user-space had used std.
Fun fact: AMD's first proposed syscall / swapgs design didn't mask RFLAGS, but they changed it after feedback from kernel developers on the amd64 mailing list (in ~2000, a couple years before the first silicon).
jumps to the configured syscall entry point (setting CS:RIP = IA32_LSTAR). The old CS value isn't saved anywhere, I think.
It doesn't do anything else, the kernel has to use swapgs to get access to an info block where it saved the kernel stack pointer, because rsp still has its value from user-space.

So the design of syscall requires a system-call ABI that clobbers registers, and that's why the values are what they are.