3. Transforming DLLs into Shellcode
#srdi #golang #assembly #x64 #shellcode #shellcodedevelopment
Last updated
Was this helpful?
#srdi #golang #assembly #x64 #shellcode #shellcodedevelopment
Last updated
Was this helpful?
A common technique used in offensive security is reflectively loading dlls in memory in order to spawn a beacon or to add functionality to an already existing implant. In some ways this is a stealthier way of executing code since we don't have to write a dll to disk and we don't generate any kernel alerting indicating a new module has been loaded in the process.
Also, turning the dll into shellcode gives additional flexibility since we can use our favourite / to execute dll code.
Fortra has recently release an announcing that cobalt strike users could define their own reflective loader to help evading security solutions. Bobby Cooke from IBM X-Force Red has released this to describe how a user defined loader has been implemented to cobalt strike.
Let's do a deep dive on how to write a reflective loader in assembly, that turns any dll into position independent shellcode.
In this I went through the code of creating a reflective loader using Go. Turning a dll into shellcode involves taking the bytes of a dll and append (or prepend) the code described in the above article to the dll bytes.
For this project we will structure the shellcode as shown in the image below:
Once we run the code we will hit a jump instruction. This instruction will help us jump over the DLL bytes and DLL size to the shellcode were all the magic happens. We could have the Reflective loader at the very top instead of having a jmp instruction but that would make the development process a bit harder as I would have to constantly modify the offsets as my shellcode grew.
Our shellcode will then allocate some memory in the Heap, copy over the headers, sections modify a few bits, assign it execute privileges and hopefully the dll will run.
Anyone attempting to follow the next section should be familiar with the following subjects:
Understanding of PE Headers (PE-Bear is a great tool)
Basic use of windbg (or any windows debugger capable of debugging x64 code)
The full code can be found here:
As per the structure we defined the first instruction in our shellcode will be a jmp instruction that will take us on the first line of the Reflective Loader.
The above code reads the contents of the dll file and writes them into a byte slice dllBytes. It then turns the size into 64-bit unsigned integer.
The last line creates the string in the format keystone-engine expects it to be in order to turn it to opcodes.
So if the size of the dll is 20-bytes the jmp instruction will jump forward 33-bytes.
20 bytes (dll bytes)
8 bytes (the size value at the end of the bytes)
5 bytes for the jmp instruction
Lines 2-9: push all non-volatile registers to stack
Line 10: This is not best practise for x64 but I find it easier to have rbp as a reference for my local variables.
Line 11: Setting the last 4-bits to 0 ensures that our stack stays 16-byte aligned. We could face random crashes if our stack is not aligned.
Line 12: We create space in our stack for our local variables.
Line 1 : Gets the address where our dll size is kept
Line 2: Get the dll size in rax
Line 3: Sub the dll size from the address to get the base address of the dll
Line 4: Push the dll size to stack
Line 5: push base address of the dll to stack
Checking quickly if we have the right values in windbg. So rax holds the value 0xb55e. That's the equivalent of 46430.
Cross checking the size of the dll on disk we can confirm that this is the right value.
And rdi points to the beginning of the dll.
The actual reflective loader code starts below:
The aim of the above code is to identify the size of the dll in order to allocate the right size in the upcoming VirtualAlloc API call.
Line 3: Move base address of dll in rdi
Line 4: Get the nt header offset to eax.
A quick check to ensure we have the right value in eax:
We can see that File address of new exe header is 80 so we have the right value
Lines 5-10: A series of calculations to calculate the addresses of nt, file and optional headers. We also store them in the stack in case they are needed in the upcoming code.
Let's check if rdi on line 10 holds the address of the optional header. We expect to see the value 20B.
So cross checking in windbg shows the right value. Great.
Line 11 & 13: Move the size of the image to eax and the image base to rax followed by a push instruction
We can see that at offset B0 and D0 we have the desired values. When we check the stack we should find those values stored.
lpAddress will be set to the the ImageBase (if available)
dwSize will be equal to the Size of Image
flAllocationType = MEM_RESERVE | MEM_COMMIT = 0x3000
flProtect = PAGE_EXECUTE_READWRITE = 0x40
We now have to assign these values to rcx,rdx,r8 and r9 before making the function call.
Lines 2-4: Use the parse_module to get the address of virtual alloc:
r9 -> kernel32 base address
Line 5: Pop imagebase from the stack to rcx (first argument)
Line 7: Pop image size from the stack to rdx ( second argument)
Line 8: r8 = 0x3000 (third argument)
Line 9: r9 = 0x40 (fourth argument)
Lines 10&12: Allocate shadow space as per the x64 calling convention.
Line 13: Store the allocated address to the stack
Line 14: Calculate the difference between desired address and allocated address (if different). It will be useful later on when we are relocating hardcoded addresses.
Line 15: Save the address difference to the stack
To simplify the Proof of Concept we are using WriteProcessMemory to copy the headers to the destination address. WriteProcessMemory is monitored by most EDRs so it might cause our payload to be flagged. It should be easy enough to write a memcpy function in assembly.
In this section we are using 2 windows APIS
This API doesn't take any arguments and it always returns -1 (0xFFF..). We could hardcode this value but it's not recommended by Microsoft.
Line 2: Similarly with the previous functions we have kernel32 base address in r9
Line 3: The hash of the function in r8
Line 4: Call parse_module to get the function address
Line 5: And call rax where the function address is stored.
Line 6: We then save the handle to the stack
Let's have a quick look on what the arguments should be:
hProcess = The pseudo handle output from the GetCurrentProcess() function
lpBaseAddress = The output from the VirtualAlloc() function
lpBuffer = Raw DLL bytes
nSize = Size of headers from optional header
lpNumberOfBytesWritten = Pointer to the stack.
Let's see how does this translate in assembly code.
As always Lines 1-4: We pass kernel32 base address to r9, WriteProcessMemory hash to r8d and we call parse_module.
Now it's a good time to refer to our index to find where are the desired values on the stack
line 5: rcx = Moved the pseudohandle to rcx
line 6: rdx = Moved the dll base to rdx
line 7: r8 = Moved the raw dll bytes to r8
line 10: Move size of headers from optional header + offset 0x3c
Line 9-11: Create a zero qword onto the stack, Get a pointer to the location and push to the stack since this is the 5th argument.
Line 13-15: Create shadow space and call function
We have done quite a lot, let's see if we get the values we expect before and after the call instruction in windbg.
Everything looks as we expect them before the call. Let's see if the 0x600 bytes are written to the destination memory after the call.
The variable lpNumberOfBytesWritten is set to 0x600, but let's double check the destination memory if it has the same contents as the buffer.
All is looking good :)
Let's dive into the assembly
Before copying the sections to the destination address we need to identify the section header address and the number of sections
Line 2-3: Section header is located at offset +0xf0 from the optional header
Line 4: Move fileheader address to rdi
Line 5: Move number of sections to rax
Line 6: Store the in rdi
The above code might look scary at first but let's break it down:
Line 2: Line 2 checks if rdi is zero. rdi holds the number of sections and the value decrements by 1 with each iteration.
Line 3: If rdi == 0 it means that all sections have been copied and we should break off the loop by jumping to line 32 (copy_sections_loop_finished).
Lines 5-7: Identify WriteProcessMemory using parse_module
Let's have a quick look what arguments we will be passing to WriteProcess Memory
hProcess = The pseudo handle output from the GetCurrentProcess() function
lpBaseAddress = Base address + RVA of the section
lpBuffer = Base address of Raw DLL bytes + Raw Address
nSize = Size of Raw Data
lpNumberOfBytesWritten = Pointer to the stack.
Line 12: Moves Relative Virtual Address from the section header to r12
Line 13: Calculates the Virtual Address by adding base address (from VirtualAlloc)
Line 16: Moves the pointer of Raw data to r12
Line 17: Adds the base address of the raw bytes
Line 20: Moves the size of raw data
Let's have a quick look at the data when the WriteProcessMemory function is called for the first time.
The data matches to the .text section data we see from the PE-bear section
Line 27: Decrements the sections counter by 1
Line 28: Adds 0x28 to r13, It points to the beginning of the next section
Line 29: Jumps at the beginning of the loop
With all our sections in place the next task is to find all the hardcoded addresses in our code, and add the deltaImageBase we calculated earlier in our code.
Let's have dive into the assembly.
The code is fairly long but let's break it down.
We have the outer loop ( memory_relocations_loop) that loops through the relocation Blocks and within each block we have the inner loop that loops through the entries.
For mydll.dll example this is how the relocations look like:
The outer loop will loop 4 times as it can be seen at the very top. The inner loop for the block at offset 0x360C will iterate over the 0x10 (16) entries.
Line 3: Relocation Blocks RVA is located at offset 0x98 from the beginning of the optional header.
Line 4: RVA Value stored in eax
Line 5: Add base address to get the relocation_table address in memory
Line 6: Zero rdi to use as a loop counter in the next section
Line 2: Move relocation_table address to rsi
Line 3: Line 3 adds the rdi to rsi, in order to point to the next relocation block
Line 4: Move PAGERVA in r8
Line 5: Move Block Size in r9
Line 6-9: relocationsCount := (relocation_block.BlockSize - 8) / 2
We are essentially turning the relocation blocksize to the number of relocation entries in the block. This will be used later in the inner loop.
Line 7: Subtracts 8 from the block size in rcx
Line 8: Performs a right shift on rcx by 1 bit, essentially dividing the value in rcx by 2
Lines 10-12 test if PAGERVA or BlockSize is zero
Line 13: If any of those values is zero the loop exits
Line 14: If none of them are zero we add 0x8 to rsi to get the address of the first relocation entry
Line 15: will have our inner loop that rotates through the entries
Line 18: Adds the block size to rdi in order to reach the next block on the next iteration
Line 19: Jumps back to the beginning of the loop
A quick check confirms that our loop performs as expected:
Line 2: rcx holds the number of entries in the block and decrements with every iteration. Here we compare to 0 , which essentially checks if we already looped through all entries.
Line 3: If rcx was zero it jumps out of the loop into the outer loop
Line 4: Moves entry Value to r11d
Let's assume r11d now has the value 0xA558.
Line 5: will zero the last 12-bits essentially leave the value 0xA000 in r11d
Line 6: shifts right by 12 bits turning r11d to 0x000A
Line 7: checks if the remaining value is zero.
Line 8: Continues to the next entry by jumping at the end of the function where our counters are adjusted
Line 10: is identical to Line 4 moving the entry value to r11d
Once again let's assume the value is 0xA558
Line 11: Zeros the top 4 bits leaving the value 0x558 in r11d. This value is the RVA from the beginning of the PAGE.
Line 12: Move page rva to r13
Line 13: Add page rva to the reglocation rva to get the relocation RVA from the dll base address
Line 15: We add the base dll address to the relocation rva to get the absolute address
Line 16: We now get the actual hardcode address from memory into r12.
Line 17: We add the delta calulcated and stored in the stack previously to the hardcoded address
Line 18: Patch the address in memory
Line 21: Decrease the relocation entries counter
Line 22: Point to the next relocation entry in the block
The last step in our shellcode is to import all external dependencies. Once again we will need 2 loops just like we did for the relocations. As we can see from the Import tab in PE-bear, we have an entry for each DLL and then a list of functions for each dll.
Our outer loop will loop through the dlls, and the inner loop will loop through the functions and import them as required.
In the outer loop we will need to use an api such as LoadLibrary or LdrLoadDll to load the required dlls and then we can use our parse_module function to get the address of each function.
Let's see how the code looks in assembly.
This is the longest part of the code so let's break it to smaller sections.
Here we need to capture the address of the Import Descriptor.
Line 2: Move Optional Header address into r13
Line 3: Import Descriptor RVA is located at offset 0x78 in the OptionaHeader
Line 4: Move RVA value to r12
Line 5: Add the base dll value to the RVA in r12
Line 5: Name RVA is located at offset 0x0c from the beginning of the import descriptor
Line 6: Move RVA value into r13
Line 7: Compares RVA value to 0
Line 8: If the value is zero it breaks off the loop
Line 9: Calculates the absolute address by adding the DLL base address to the RVA
Line 10: Move absolute address to rsi
It would be easier to just use LoadLibrary since all we have to provide to the API is a pointer to the name of the dll which already stored in rsi.
Instead we use LdrLoadDll which makes use of the UNICODE_STRING structure and is located in the ntdll. A few years back it would even be considered more stealthy, but I am not sure it's the case nowadays.
Let's take a look at the function definition.
In order to understand the arguments passed to this (undocumented) function I ran LoadLibrary and set a breakpoint on the LdrLoadDll
PathToFile was set to 1
Flags was a pointer pointing to 0
ModuleFileName is a pointer to the UNICODE_STRING struct holding the name of the dll
A pointer to the address we would like the dll's base address to be returned
The arguments that LdrLoadDLL are straight forward except the UNICODE_STRING struct. Let's have a quick look on the struct definition
Firstly we need to transform the dll name from a null terminated byte array to a wide character array. This essentially means that every character should bea word where the first byte is what we had already followed by another 0 byte. The null terminator will be two null bytes.
Let's take kernel32.dll as an example. In memory we currently have
4B 45 52 4E 45 4C 33 32 2E 64 6C 6C 00
When this is transformed to a Wide string this is how it will look in memory
4B 00 45 00 52 00 4E 00 45 00 4C 00 33 00 32 00 2E 00 64 00 6C 00 6C 00 00 00
We then have to calculate the length of the wide dll string.
The length will be 26 (0x1a) which is the length of the wide string without the null termination bytes
Maximum Length will be 28 (0x1c) which is the size of the wide string including the termination bytes
Let's dive into the code on how we construct the UNICODE_STRING
Line 1-2 : Zero rax and r11 to be used by the loop
Lines 3-13: Create a space of zeros on the stack
Line 16: Copy the next byte at address RSI to al
Line 17: Check if it's the null termination
Line 18: Break the loop by jumping to line 24 (end_loop_through_DLL)
Line 19: Write the byte to the stack
Line 20: Point to the next location in the stack by leaving a byte with 0 in between
Line 21 : Iterate
When we reach this point it means our whole string is turned into a wide string in memory
Line 25: Calculates the Maximum Length by adding the 2 null terminting bytes in the size
At this point we start constructing the struct in the stack
Line 26: we move the MaximumLength value in ax from r11w
Line 27: we shift rax left by 16bits
Line 28: subtract 2 to get the Length
Line 29: we merge the max length and length in rax by using or
Line 30: we get a pointer on the wide string on the stack
Line 31: We push pointer to the wide string the stack
Line 32: We push the lengths to the stack
We now how the struct into the stack.
With now have all the values we need to call LdrLoadDll
Line 1: Pointer to of the unicode_string in rsi
Line 2: Store rsi to the stack
Line 3: Move the base address of ntdll in r9
Line 4: Move the hash of LdrLoadDll to r8
Line 6: Call parse_module to get the function address rax
Lines 7-8: Set the first argument by setting rcx to 1
Line 9: Set third argument by popping the address of the structure to r8
Lines 10-12: Set rdx (second argument) to a pointer that points to 0
Line 13: Set r9 (fourth argument ) to a pointer that points to 0. The dll base address will be stored here
Line 14: Move r9 to rsi to use after the function call
Line: 16 & 18: Add and remove shadow space before and after the call
Line 17: Call LdrLoadDLl
Line 19: Store import descriptor address to stack before the inner loop
Line 21: First thunk RVA
Line 22: Absolute address to First thunk
We now Loaded the DLL in memory using LdrLoadDll. Next step is to import the functions from that dll using parse_module
Line 2: Get the RVA of the function name to r13
Line 3: Checks if RVA equals to 0
Line 4: Break off the loop
Line 5: Get Absolute address of the function name
Line 6: Add 0x2 to the absolute address in r13 to jump over the two null bytes
Line 23: Move dll base address to r9
Line 24: Move hash from edx to r8
Line 25: Call parse module
Line 27: Overwrite Original thunk with the function address
Line 28: Point to the next function
Line 29: Iterate.
We are now ready to execute our code
Our code is now ready to be executed.
All we have to do now is to call the entry point (DllMain). The address of the entry point can be found in the optional header at offset 0x10
Line 1: Move optional header to r13
Line 2: Add the 0x10 offset to r13
Line 3: Move the RVA value to r13
Line 4: Add dll base address to get the absolute address of the entry point
Let's have a quick look at the Entry point (DllMain) definition:
hinstDLL is the base dll address
fdwReason we will set to 0x1 for DLL_PROCESS_ATTACH
lpvReserved will be set to 0
Line 5: the base address is moved to rcx ( 1st argument)
Line 6: rdx set to 0x1 (2nd argument)
Line 7: r8 set to 1 (3rd argument)
Lines 10&12: Add and remove shadow space before and after the call
Line 11: Call Entry point
At this point if everything went well we will see a popup window from our DLL
Great :)
In the epilogue we restore the rsp values and all non-volatile register. This is especially important if we are planning to resume execution to the calling program.
In order to check if our shellcode will resume execution without crashing our main program, we will have to modify the shellcode runner to run the code inline and not to create a new thread.
In our shellcode runner we can replace the CreateThread function with the syscallN function.
This is how our new shellcode runner function looks like.
Also, in our main function we add another print function after calling the shellcode runner function
If we successfully restore all the registers and the stack pointers, we should see the above message printed in the console after we press "OK" on the messagebox.
Our shellcode returns cleanly to the main function.
Let's test our code when it's injected into a remote process. We switch our function from the shellcoderunner to shecllodeIjnection
We then give it the pid of a running notepad
The code is meant to be used for educational purposes. It's not ready to be used in production or as part of a real red team engagement. Here is a list of a few future improvements:
Remove all null bytes from the shellcode
Ability to generate shellcode that calls any exported function and pass arguments to it.
Add indirect syscall functionality
Remove unnecessary winapi calls such as WriteProcessMemory
Clean memory after execution
Avoid the use of rwx regions
I recommend following along in your debugger to truly understand the implementation below. It's hard to follow assembly in a blog
If we want our main program to continue executing after running our shellcode we should preserve all non-volatile registers as per the .
These functions are explained in detail in me .
r8d -> VirtualAlloc hash calculated using
The background knowledge of what we are trying to achieve can be found and the Go code .
The background knowledge can be found and the go code .
Lines 9-20: Turn the function name into a function hash as described