Notes on binary exploitation

This page discusses how to write an exploit for a stack overflow vulnerability that circumvents ASLR and stack canaries.
Not required for the exam.
I think it may be very useful for understanding in more (yet not full) depth what exploiting a memory corruption vulnerability implies in practice.

Linux Facts

In 64-bit Linux, the first six input arguments to a function must be placed in specific registers, not on the stack: rdi (First argument), rds (Second argument), rdx (Third argument)...and so on.
For example, the system library function receives the address of the input string in the rdi register, not on the stack.
Every process using shared libraries contains in its memory:
A GOT (Global Offset Table). The GOT is a table of pointers to external library functions. The values for these pointers are usually filled in dynamically upon the first invocation of each function (Lazy Binding). The executable file specifies which functions need an entry in the GOT.

The GOT's location is fixed relative to the base address of the code region. This base address may be either static (i.e., identical in all instances of the executable) or randomized (i.e., different in each execution of the executable).
A PLT (Procedure Linkage Table). As a first approximation, it is a table with one entry for each GOT entry. A PLT entry is a short code sequence (a “stub”) that merely jumps to the address stored in the corresponding GOT entry and then returns.

A call to an external function is implemented by jumping to the PLT entry of that function (after preparing the input arguments expected by the function).
ASLR randomizes the starting address of data, stack, heap; it also randomizes the starting address of libraries.

The position of the starting address of the code region depends on how the executable was compiled: if PIE (Position Independent Executable) is enabled, then it is randomized; otherwise it is defined statically, i.e., it is always the same in every execution. By default, PIE is enabled.

Assumptions

Vulnerability

The input triggers a stack overflow.
The overflown buffer is large enough to hold the entire ROP chain (one single gadget in this case).

Adversary objective

The adversary wants to invoke system(“/bin/sh”) (code reuse with the “return to libc” technique).
In order to prepare the input argument for the function, the adversary must place the address of an existing instance of “/bin/sh” in the rdi register. This is done with the ROP gadget pop rdi; ret; (code reuse with the “ROP gadgets” technique; there are many such ROP gadgets in an executable).

Thus, the adversary must overwrite the stack of the vulnerable function so that it will contain (read it bottom-up):

BEFORE	AFTER
	Address of system()	…then, the ret from the gadget will jump to this address
Input arguments of vulnerable function	Address of "/bin/sh"	The pop rdi instruction in the ROP gadget will grab this value and put it in the RDI register…
Return address	Address of ROP gadget
Local variables	Junk data

The problem is determining the 3 addresses required.

If ASLR was not enabled, then the adversary could determine the 3 addresses in advance, by examining a copy of the vulnerable program. Then, the adversary could construct a byte sequence to inject in the vulnerable program once and for all: the same byte sequence would work in all the instances of the vulnerable program.

With ASLR it is not possible to construct a single byte sequence once and for all. The reason is because one or more of the addresses to be placed on the stack can only be discovered by interacting with a specific instance of the vulnerable program, as those addresses will be different in every execution. In detail:

The offset of the system function from the beginning of libc can be found in advance. The starting address of libc must be determined at run time.
The same applies to an instance of “bin/sh” and to the ROP gadget.

We outline below how an exploit that circumvents ASLR may be constructed. First, with PIE not enabled, then with PIE enabled.

In a nutshell, construction of the byte sequence to inject in the vulnerable process requires one or more prior interactions with that process. Those interactions are required for discovering the necessary information and are executed by a code developed for that purpose.

Assumptions for exploitation

Adversary capabilities

The adversary may send text lines on the standard input of the vulnerable process and may receive the response from the standard output as a text line.
The adversary may execute several iterations of this kind, i.e., it can send a text line that overflows the buffer, receive a response, then send another text line that will trigger the same buffer overflow and so on.

If the standard input is associated with a network connection, then the adversary may operate from a remote location. Otherwise, the adversary must execute code locally, i.e., on the same machine as the vulnerable process.

As an aside, the technical literature is sometimes confusing as to what an exploit actually “is”. Sometimes it is the piece of data that is input into the vulnerable program, some other times it is the code mentioned here. Certain exploits are files containing scripts that interact with the vulnerable program to determine the missing information for the actual exploitation (e.g., macros or JavaScript). One must understand from the context, what is exactly the nature of the exploit. These considerations are largely orthogonal to the topic of this document, though.

Vulnerable program

Stack canary is not enabled.
PIE is not enabled.
It follows that the base address of the GOP and of the PLT are known.
It also follows that the address of the pointer to any given external function is also known (their offset from the start of the GOP / PLT is specified in the executable).

Exploitation outline (PIE not enabled)

The adversary needs to find A-ROP, A-SYSTEM, A-CMDSTRING.

A-ROP may be found in advance, assuming there is a suitable ROP gadget in the code region (i.e., not in libc).

A-SYSTEM and A-CMDSTRING must be found at run time. The code of the adversary may be structured in two stages, each stage exploiting the stack overflow vulnerability.

Stage 1: code reuse for invoking a library function that will print on the standard output the address A(F) of a function F of libc.
Stage 2: computation of A-SYSTEM and A-CMDSTRING from the value A(F) read in the standard output (details below) and then actual exploitation.

The offset of function F in libc is known statically. If the address A(F) of this function was known, then the starting address of libc would be derived easily: address \= base + offset. With the starting address of libc, A-SYSTEM and A-CMDSTRING can be derived in the same way (their offset is known statically).

The value A(F) can be obtained as follows:

A(F) is stored in the PLT/GOT, provided that: F is an external function actually used by the vulnerable program; and, F has been invoked at least once at the time the GOT is accessed (because of lazy binding).
The address of the entry of the GOT associated with F (i.e., the memory address containing A(F)) is obtained as follows: the offset of the GOT is known statically (because PIE is not enabled); the offset of F in the GOT is also known statically, from the content of the executable.
In order to print the value A(F), the adversary may invoke the library function puts, that takes a pointer and prints its value on the standard output; the input argument must be the address of the entry of the GOT associated with F (so that puts will print A(F)).
Finally, In order to invoke puts, the adversary needs to know its address. In practice, puts satisfies the requirements in the first bullet. Thus, there will be an entry in the PLT that allows invoking puts. The address of this entry is obtained as follows: the offset of the PLT is known statically (because PIE is not enabled); the offset of puts in the PLT is also known statically, from the content of the executable.

Stage 1 may thus be structured as follows.

Construct a byte sequence that overwrites the stack this way (read it bottom-up):

BEFORE	AFTER
	Address of main()	We want puts to return to the main function, so that the process memory layout remains unchanged. Eventually, the execution will reach again the vulnerable function (look at the assumptions about the vulnerable program).
	Address of PLT entry for puts()	…then, the ret from the gadget will jump to this address
Input arguments of vulnerable function	Address of GOT entry for F	The pop rdi instruction in the ROP gadget will grab this value and put it in the RDI register…
Return address	Address of ROP gadget
Local variables	Junk data

Send the byte sequence above to the vulnerable process.
Read the response.
Compute the base address of libc. Compute A-SYSTEM and A-CMDSTRING.

Stage 2

Construct a byte sequence that overwrites the stack as shown in a previous section.
Send the byte sequence above to the vulnerable process.

Remarks

There are many tools for simplifying the writing of the code that executes the steps described above (the “exploit”). One such tool is the pwntools Python library. An example is in the next section
Even minor changes in the libc version, compiler version, o.s. version, configurations will let the exploitation fail. Practical exploitation is very hard. Real world exploits may or may not be "robust". Their robustness may depend on the specific vulnerability, the specific vulnerable program, the ability to change behavior at run-time based on the specific environment in which the vulnerable process runs.
Function F may be puts itself, it looks like a sort of circular reasoning. But if one focuses on the requirements of F, it is simple to realize that puts satisfies those requirements. So an exploit based on the above ideas would let puts print its own address at stage 1.

Exploit written with pwntools

WARNING: generated by Gemini. I have not checked it works. It looks correct and sufficient for understanding the essence of the topic, however.

from pwn import *

# 1. Setup the binary and process
elf = ELF('./vulnerable_binary')
p = process('./vulnerable_binary')

# 2. Find our tools
rop_gadget = 0x400733        # Address of 'pop rdi; ret'
plt_puts = elf.plt['puts']   # Location of puts in the PLT
got_read = elf.got['read']   # The GOT entry for read we want to leak
addr_main = elf.symbols['main'] 

# 3. Build the Leak Chain
payload = b"A" * 72          # Buffer overflow padding
payload += p64(rop_gadget)   # Jump to pop rdi
payload += p64(got_read)     # This goes into RDI
payload += p64(plt_puts)     # Call puts(got_read)
payload += p64(addr_main)    # Return to main to keep the exploit going

p.sendline(payload)

# 4. Receive and parse the leaked address
leaked_bytes = p.recvline()[:8].strip().ljust(8, b"\x00")
leaked_read = u64(leaked_bytes)

print(f"[*] Leaked read() address: {hex(leaked_read)}")

Exploitation outline (PIE enabled)

To do

Exploitation outline (Stack canary enabled)

To do

Differences in Windows (outline)

In 64-bit Windows, the first four input arguments are placed in rcx, rdx, r8, and r9. Furthermore and most importanly, the caller must reserve 32 bytes on the stack immediately before calling a function (so called “shadow space”). The reason for this requirement is irrelevant here. If this requirement is not satisfied, then the caller function will most likely provoke a process crash.
Every process using DLLs contains in its memory data structures analogous to PLT and GOT in Linux, but structured differently:
- The IAT (Import Address Table) is the conceptual equivalent of the GOT. It holds the absolute addresses of functions inside DLLs.
- Import Stubs: are the conceptual equivalent of the PLT. These are the small jumps located within the executable's code section that redirect execution to the address stored in the IAT.
In most cases, ASLR randomizes also the starting address of the code region. In old Windows versions, there was a compiler option (/DYNAMICBASE) that was the equivalent of PIE in Linux. Today this option is enabled by default and Windows may force ASLR on the code region even in binaries not compiled with /DYNAMICBASE.

Key implications

Execution of the Stage 1 of the exploit (aimed at finding the address of the system library function) must be always preceded by some information leak step (because the starting address of the code region, in practice, is always randomized).

The information leak step requires an additional vulnerability. That is, in order to exploit a stack overflow vulnerability, there must be another vulnerability that:

can be exploited by the adversary; and,
leaks an address from which the adversary can derive the starting address of the code region.

Examples of those addresses are:

A saved return address on the stack.
A function pointer stored on the heap or stack.

Assume the adversary can force the vulnerable process to output one of those addresses, say AX. Of course, based on the knowledge of the vulnerability, the adversary knows what is the symbol in the program associated with address AX. The adversary knows the offset of that symbol in the code region (by analyzing the executable). The exploit can thus derive the base address of the code region at run time easily (address = base + offset).

At this point, the logical structure of the exploit is analogous to what described above.

Exploitation with partial overwrite (outline)

In certain cases the preliminary information leak may be obtained with a single stack overflow vulnerability that is exploited twice, in a different way: first, for leaking an address that will be used for deriving the base address of the code region; second, for the steps already outlined.

This possibility might apply both to Linux and to Windows.

Assumptions:

There is a code snippet CX that prints the content of a certain pointer.
The value of that pointer is the address of a function at a known offset in libc (so that the actual value of that pointer allows deriving the base address of libc).
CX is within the same 4096-byte (0x1000) alignment as the original return address (i.e., the address of CX and the return address must differ only in the 12 least significant bits).
After executing CX, execution eventually returns to the vulnerable function.

When the operating system loads a program in memory, the start of each section must be aligned on a page boundary. In almost all operating systems today, pages are 4096 bytes thus the starting address must be a multiple of 4096 (the 12 least significant bits are zero).

The offset of CX is known and can be obtained from the executable. The actual address of CX at run time is not known, but its least 12 significant bits are known (they will be the offset).

Said that, the preliminary information leak occurs as follows:

The first stack overflow exploitation will overwrite only the 12 least significant bits of the return address already present on the stack, so that the (overwritten) resulting address will point to CX. This way, when the vulnerable function returns, CX will be executed and this execution will leak an address. This address will be used for deriving the base address of the code region and for the subsequent exploitation steps, as indicated above.
If the stack can only be overwritten at the byte level (which is the case when exploiting a stack overflow), then the value to overwrite as the second least significant byte of the return address is only partially known (only the 4+8=12 least significant bits of CX are known).:
- Bits 0–7: Known.
- Bits 8–11: Known.
- Bits 12–15: Randomized (The 4-bit "Nibble").
To solve the previous problem, a brute force approach may be used because there are only 16 different possibilities. Usually a wrong value will provoke a crash of the vulnerable process. As long as the process is always restarted, eventually the process will leak the necessary information. All that is needed is for the exploitation code to wait for a while before injecting the next attempt.
Once the correct value for the address of CX will have been found (one that does not provoke a crash), the exploit will derive the base address of the code segment at run time in the usual way.

Practical remarks on brute force

Regarding the brute force step, these are the three most common scenarios:

The Forking Server (The Ideal Case). Many Linux network services (like older versions of Apache or simple socket servers) use the fork() system call to handle incoming connections. When a new connection arrives, the "Parent" process forks a "Child" to handle it. The child is an exact memory copy of the parent. This includes the ASLR base addresses. If the attacker crashes a child process with a wrong brute-force guess, the parent doesn't care. It simply waits for the next connection and forks a new child with the exact same memory layout. You can brute-force the 1-in-16 nibble in milliseconds.
The Auto-Restart (Systemd / Watchdogs). In both Linux and Windows, critical services are often managed by a supervisor (like systemd or Windows Service Control Manager). If the service crashes (Segmentation Fault), the supervisor detects the exit code and restarts the process. Each restart generates a new randomized base address. The attacker just keeps sending the same "guess" (e.g., guessing the nibble is 0x1). Eventually, by pure statistical luck, the OS will load the binary at a base address where that nibble actually is 0x1. You might have to wait a few seconds between crashes, but you'll eventually "hit" the target. Modern systems have "Exponential Backoff." If a service crashes repeatedly, systemd might wait 1 second, then 5, then 30 before restarting it again.
The "One-Shot" Client. If you are attacking a standalone binary (like a command-line tool or a PDF reader), a crash is fatal. The process dies, and it won't run again until the user manually opens it. Brute-forcing is almost impossible here unless the attacker can trick the user into opening the file hundreds of times. In these cases, attackers will do need an additional vulnerability for obtaining the necessary information.