We repeated the startup of the binary, this time also running tcpdump on the host machine. After no initial network traffic could be observed after the execution of the binary, we let the virtual machine run for 24 hours and recorded any network traffic. As there wasn't any, we decided to move on to static code analysis of the binary.
We used the Linux 'file' command to determine characteristics for the-binary. Here is the output of this command:
$ file the-binary the-binary: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, strippedGiven that the-binary was in ELF format, we next set about to determine what systems calls were being made. We first used 'objdump' to generate an assembly code listing, then wrote a Perl script to find the type and location of the system calls. The script operated by maintaining the current state of the eax register and looking for 'int $0x80' (Linux system call trap) instructions. The value present in the eax register at the time of the instruction is the index into the system call table as defined in /usr/include/asm/unistd.h. The script showed the following system calls were being made:
80480b4: 0x88 personality 8048105: 0x1 exit 8056a11: 0x72 wait4 8056a54: 0x66 socketcall 8056a9c: 0x66 socketcall 8056ae4: 0x66 socketcall 8056b26: 0x66 socketcall 8056b72: 0x66 socketcall 8056bcc: 0x66 socketcall 8056c1e: 0x66 socketcall 8056c78: 0x66 socketcall 8056cd1: 0x66 socketcall 8056d1c: 0x66 socketcall 8057140: 0xc chdir 805716c: 0x6 close 805716c: 0x6 close 805719b: 0x3f dup2 80571ca: 0xb execve 80571f0: 0x2 fork 8057214: 0x31 geteuid 8057238: 0x14 getpid 8057263: 0x4e gettimeofday 8057292: 0x36 ioctl 80572bf: 0x25 kill 80572ee: 0x5 open 805731e: 0x3 read 8057344: 0x42 setsid 8057372: 0x7e sigprocmask 805739c: 0x7a uname 80573c8: 0xa unlink 80573fa: 0x4 write 8057424: 0x1b alarm 8057450: 0xd time 8057482: 0x92 writev 80574ac: 0x52 select 80574f7: 0x43 sigaction 8057530: 0x48 sigsuspend 8057560: 0x1 exit 8065d23: 0x5a mmap 8065d65: 0x6a stat 8065da1: 0x6c fstat 8066106: 0x37 fcntl 8066136: 0x13 lseek 8066163: 0x5b munmap 8066192: 0x91 readv 80661c6: 0xa3 mremap 8066206: 0x2d brk 8066244: 0x2d brkWith the information that the binary was statically linked and the location of the system calls, we next began to look for the libraries used in creating the-binary.
Florian then continued the reverse engineering process:
Rajeev had been looking for a free decompiler we could use and suggested that a free decompiler, the Reverse Engineering Compiler (REC) was available for Linux. I downloaded the decompiler, and executed it on "the-binary" with the default settings. The result from the decompile was the file "the-binary.rec".
A look at the decompilation quickly showed that this was only a small improvement from the assembly code. As the binary had been stripped of all its symbols, all variable and function names still looked very assembly-like. However, it was now much easier to follow the control flow of the code. A brief examination of the code revealed, that functions as well as global variables were named after their absolute address in the assembly code, prepended with a "L0" (functions) or "*L0" (variables). Examples:
L08048088() a function, such as main()
*L0806D228 a global variable, such as
environ
0x0606D228 address of a variable
(&environ)
Local variables are allocated from the stack and therefore are denoted as an offset from the base pointer (ebp), so the assignment
*(ebp + -17616) = ebp + -2048;makes ebp + -17616 a pointer that holds the address of some variable that starts at ebp + -2048.
Function parameter names start with A8, and their numeric value increases by 4 in hexadecimal notation (Ac, A10, A14, A18, A1c, ...). Furthermore, there are also local variable names such as Vffffffbc, which also seems to be some sort of offset from the stack pointer. The size of variables can only be determined by looking at context and neighboring values. For example, there are variables
ebp + -2048 ebp + -4096 ebp + -4536without any values in between those. Thus ebp + -4096 is 2048 bytes and ebp + -4536 is 440 bytes. However, even though we have
*(ebp + -17616) = ebp + -2048; *(ebp + -17620) = ebp + -2028; *(ebp + -17624) = ebp + -2026;I concluded later on that ebp + -2048 is a buffer of 2048 bytes and that ebp + -17620 and ebp + -17624 are merely pointers into that buffer.
As Jim had prepared a list of system call addresses from the assembly file, I decided to start putting those into the code. Basically, wherever there is a function that contained the line
asm("int 0x80");it was likely to be a system call. For example, the function:
L08056A2C(A8, Ac, A10) /* unknown */ void A8; /* unknown */ void Ac; /* unknown */ void A10; { /* unknown */ void ebx; /* unknown */ void Vfffffff4; /* unknown */ void Vfffffff8; /* unknown */ void Vfffffffc; Vfffffff4 = A8; Vfffffff8 = Ac; Vfffffffc = A10; ecx = & Vfffffff4; eax = 102; ebx = 5; asm("int 0x80"); edx = eax; if(edx < 0) { *L08078B14 = ~edx; edx = -1; } return(edx); }calls system call 102 (or 0x66) and 5 is being passed as a parameter. System call 0x66 is the "socketcall" system call, and a "man 2 socketcall" revealed that the first argument is the call number. A look at <linux/net.h> (I actually did a "find . | xargs grep -d skip LISTEN" in the "/usr/include" directory to find the correct header file) revealed that SYS_ACCEPT was equal to 5. Thus I could conclude that the above function was the "accept" system call. I then replaced all invocations of "L08056A2C" with accept. I proceeded like that with all the system calls from Jim's list. If applicable, I also replaced integer numbers with the constant names:
(save)11; (save)3; (save)2; L08056CF4();became
socket(AF_INET, SOCK_RAW, NVP);where NVP is not really a constant, but I put it in for better readability. A complete mapping for the system call functions can be found in the file system_functions.txt.
After putting every system call in, the code looked a little better (see file decompile_with_syscalls.c), but it was still too early to really learn anything from it.
Since the binary had been statically linked, large parts of the standard C library were likely to be included in it. Jim and I agreed that the next step would be to identify the functions from the standard library. Once identified, they could be properly named and their code be removed from the code file. As I had noticed earlier, some of the functions never got called, so I wrote a simple perl script that identified the number of occurrences of functions in the code (proc_check). Another script would remove functions that I specified in a file from the code (pruneit). As the removal of a function could cause other functions not being called anymore, this process needed to be iterated until no more "dead" functions could be pruned out. This technique actually reduced the number of lines from 37,228 to 22,894.
There are several ways to identify functions from the standard library. Given the library's source code, one can try to identify functions based on other functions they call, strings they contain, or constants they contain. Another method is to look at the context where a given function is called and then make an educated guess as to what function it might be and then compare the function's code with the library's source code. The easiest way is definitely using plaintext strings that are contained in the functions. However, I had started using the glibc-2.2.5 library source code as the base for my comparison. Many of the strings I found in the decompiled code could nowhere to be found. My first suspicion was that some other library other than the standard C library was compiled in as well. I also had a very hard time matching up the code as the glibc source code contains plenty of macros and #ifdefs.
Fortunately, Jim pointed out that the binary was compiled using the libc-5.3.12 library. I downloaded the source code (from ftp.linux.org.uk/pub/linux/libc/). Suddenly, my work got much much easier. Once a function had been identified, I first checked if it was calling other, unidentified functions, noted what they were and then replaced the "L080..." name with the proper one for all newly discovered functions.
eax = L0804F620("/etc/resolv.conf", "r");a 'find . | xargs grep -d skip "/etc/resolv.conf"' in the root directory of the library source code didn't give any exact matches, but variable _PATH_RESCONF was defined as the string. The same kind of search for that variable name then revealed the next line, the only one that matches the above:
./inet/res_init.c: if ((fp = fopen(_PATH_RESCONF, "r")) != NULL) {Thus L0804D744 had to be res_init and L0804F620 fopen. To show the degree of code similarity, here is the code for L0804F620 and for fopen to compare. The res_init function looks equally similar to its L0804D744 counterpart and from there more functions, such as fgets and strncpy can be derived. From the fopen function we can then derive the malloc (L0805BD74) and free (L0805C290) calls, and so forth.
L0804F620(A8, Ac) /* unknown */ void A8; /* unknown */ void Ac; { /* unknown */ void ebx; ebx = L0805BD74(84); if(ebx == 0) { eax = 0; } else { (save)0; (save)ebx; L08061F34(); *(ebx + 80) = 0x807902c; (save)ebx; L08060D24(); (save)Ac; (save)A8; (save)ebx; esp = esp + 24; if(L08060E20() == 0) { L08061788(); L0805C290(ebx, ebx); eax = 0; } else { eax = ebx; } } } |
_IO_FILE * DEFUN(_IO_fopen, (filename, mode), const char *filename AND const char *mode) { struct _IO_FILE_plus *fp = (struct _IO_FILE_plus*)malloc(sizeof(struct _IO_FILE_plus)); if (fp == NULL) return NULL; _IO_init(&fp->file, 0); _IO_JUMPS(&fp->file) = &_IO_file_jumps; _IO_file_init(&fp->file); #if !_IO_UNIFIED_JUMPTABLES fp->vtable = NULL; #endif if (_IO_file_fopen(&fp->file, filename, mode) != NULL) return (_IO_FILE*)fp; _IO_un_link(&fp->file); free (fp); return NULL; } weak_alias (_IO_fopen, fopen); |
Comparison of the L0804F620 function from the decompile with the fopen function from libc-5.3.12 |
Comparing the two code snippets, you might notice that there is a discrepancy with the parameters of the functions that are called. We have:
L08061788(); L0805C290(ebx, ebx);but
_IO_un_link(&fp->file); free (fp);This is one of a few decompiler glitches. The assembly code for this looks like this:
804f664: 53 push %ebx 804f665: e8 1e 21 01 00 call 0x8061788 804f66a: 53 push %ebx 804f66b: e8 20 cc 00 00 call 0x805c290but for some reason, the decompiler associates the first ebx with the second function call. Once aware of this, I could quickly identify those glitches and rectify them. Sometimes, parameters were missing as well, but a look at the assembly code always cleared up the confusion.
For some reason, the decompiler also can't handle the modulus function if one of the operands is a function result. This results in code like:
rand(); ecx = 10; asm("cdq"); edi = ecx / ecx % ecx / ecx;The assembly code looks like this:
8048440: e8 13 dc 00 00 call 0x8056058 8048445: b9 0a 00 00 00 mov $0xa,%ecx 804844a: 99 cltd 804844b: f7 f9 idiv %ecx,%eax 804844d: 89 d7 mov %edx,%ediso the code should read:
edi = rand() % 10;The identification of the standard C library calls and the removal of their code was a long and tedious task. After I had identified all that I could (basically, there were no more distinguishing strings, constants, function calls or context left), I pruned the code of "dead" functions once again, and the resulting file (decompile_with_syscalls.c) was down to 4217 lines of code.
My next task was to interpret the C code that was left to a more readable format. Hence I went through the code, starting at the entry point and re-wrote most of it. For most parts, the "original" code was left as a comment below the re-written one.
Ben did an analysis of what was going on at startup and he concluded that this was probably standard system initialization and that function L08048134 was "main", so I started my analysis there. The biggest challenge was understanding how the variables are used and giving them proper names. Here is a mapping of the most important variables:
char buffer[2048] : *(ebp + -17616) = ebp + -2048; char buffer2[2048]: *(ebp + -17632) = ebp + -4096; char buffer3[440] : *(ebp + -17636) = ebp + -4536; unsigned char r: *(ebp + -17648); int offset: *(ebp + -17644); FILE fstream: *(ebp + -17628); char *buffer4: *(ebp + -17640); // turned out to be a pointer char buffer5[504] ebp + -17596; struct sockaddr_in cli_sock: ebp + -4568; char buffer6[19]: ebp + -17340;The final version of the interpretation can be found in the file decompile_final.c. It is a result of reading C code, reading up on network programming literature, and plenty of assistance from Ben and Jim. This is not working C code, but a person familiar with C and UNIX network programming shouldn't have any trouble following it. While interpreting, I found a few other functions from the standard C library (such as inet_addr) and removed their code. I wasn't able to identify two functions that actually get called in the code. I named them precise_sleep and signal_action, but their purpose should be clear.
I did not interpret the function I named more_udp_stuff (an initial name I gave it that I never changed), as its functionality is the same as dos_dns_udp with the an additional option of specifying a destination address.
The function dos_dns_udp is a function that sends DNS requests with a spoofed IP address to a destination address and can therefore be used as a reflector DoS client. During its analysis, I discovered that the function reads data from the read-only data section of the binary starting at address 0x8067698. These turned out to be buffer lengths followed by DNS query packets. I wrote a perl script to extract the data (dns_extract), and the data is commented and in a pseudo C code for each packet in file dns_data.c.
Furthermore, for the destination of the DNS packets, a list of IP addresses is used that resides in the .data portion of the binary starting at address 0x806d22c (in the .asm file). It seems a random address is picked from the first 8000 entries of that list. The list itself, however, is larger than that. Again, I wrote a perl script that extracted those addresses (ip_extract). The first 8000 are contained in the file ip_addresses.txt.
This concludes the analysis portion. The answers to the questions were derived from looking at the C code that we reverse-engineered.
"TCP/IP Illustrated Vol. 1", W. Richard Stevens, Addison Wesley, 1994
"Advanced Programming in the UNIX Environment", W. Richard Stevens, Addison Wesley, 1993
Intel i386 instruction manual
libc-5.3.12 source code