Initial testing

As a first measure to analyze the binary, we set aside an unused lab machine and installed VMWare on it. In the virtual machine we then installed tripwire to be able to monitor any changes of system or log files. After the virtual machine was all configured and "the-binary" had been transferred, the hard disk mode was changed from persistent to non-persistent, which would enable us to always start up the system with the initial configuration. We then disconnected the lab machine from our network and started the binary. After typing "the-binary" at the command prompt as user root, the prompt returned immediately. A "ps -aux" revealed a new process, "[mingetty]" that was running on the system. A "ps -auxc" actually showed that process running as "the-binary". A "netstat -a" showed a new open raw IP socket listening, using the Network Voice Protocol (NVP), a transport layer IP protocol. An analysis of the tripwire logs showed that no system files or logs had been modified.

We repeated the startup of the binary, this time also running tcpdump on the host machine. After no initial network traffic could be observed after the execution of the binary, we let the virtual machine run for 24 hours and recorded any network traffic. As there wasn't any, we decided to move on to static code analysis of the binary.

Static Analysis

Jim did the first testing of the binary:

We used the Linux 'file' command to determine characteristics for the-binary. Here is the output of this command:

$ file the-binary
the-binary: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV),
statically linked, stripped
Given that the-binary was in ELF format, we next set about to determine what systems calls were being made. We first used 'objdump' to generate an assembly code listing, then wrote a Perl script to find the type and location of the system calls. The script operated by maintaining the current state of the eax register and looking for 'int $0x80' (Linux system call trap) instructions. The value present in the eax register at the time of the instruction is the index into the system call table as defined in /usr/include/asm/unistd.h. The script showed the following system calls were being made:
80480b4: 0x88 personality
8048105: 0x1 exit
8056a11: 0x72 wait4
8056a54: 0x66 socketcall
8056a9c: 0x66 socketcall
8056ae4: 0x66 socketcall
8056b26: 0x66 socketcall
8056b72: 0x66 socketcall
8056bcc: 0x66 socketcall
8056c1e: 0x66 socketcall
8056c78: 0x66 socketcall
8056cd1: 0x66 socketcall
8056d1c: 0x66 socketcall
8057140: 0xc chdir
805716c: 0x6 close
805716c: 0x6 close
805719b: 0x3f dup2
80571ca: 0xb execve
80571f0: 0x2 fork
8057214: 0x31 geteuid
8057238: 0x14 getpid
8057263: 0x4e gettimeofday
8057292: 0x36 ioctl
80572bf: 0x25 kill
80572ee: 0x5 open
805731e: 0x3 read
8057344: 0x42 setsid
8057372: 0x7e sigprocmask
805739c: 0x7a uname
80573c8: 0xa unlink
80573fa: 0x4 write
8057424: 0x1b alarm
8057450: 0xd time
8057482: 0x92 writev
80574ac: 0x52 select
80574f7: 0x43 sigaction
8057530: 0x48 sigsuspend
8057560: 0x1 exit
8065d23: 0x5a mmap
8065d65: 0x6a stat
8065da1: 0x6c fstat
8066106: 0x37 fcntl
8066136: 0x13 lseek
8066163: 0x5b munmap
8066192: 0x91 readv
80661c6: 0xa3 mremap
8066206: 0x2d brk
8066244: 0x2d brk
With the information that the binary was statically linked and the location of the system calls, we next began to look for the libraries used in creating the-binary.

Florian then continued the reverse engineering process:

Rajeev had been looking for a free decompiler we could use and suggested that a free decompiler, the Reverse Engineering Compiler (REC) was available for Linux. I downloaded the decompiler, and executed it on "the-binary" with the default settings. The result from the decompile was the file "the-binary.rec".

A look at the decompilation quickly showed that this was only a small improvement from the assembly code. As the binary had been stripped of all its symbols, all variable and function names still looked very assembly-like. However, it was now much easier to follow the control flow of the code. A brief examination of the code revealed, that functions as well as global variables were named after their absolute address in the assembly code, prepended with a "L0" (functions) or "*L0" (variables). Examples:

L08048088()    a function, such as main()
*L0806D228    a global variable, such as environ
0x0606D228    address of a variable (&environ)

Local variables are allocated from the stack and therefore are denoted as an offset from the base pointer (ebp), so the assignment

*(ebp + -17616) = ebp + -2048;
makes ebp + -17616 a pointer that holds the address of some variable that starts at ebp + -2048.

Function parameter names start with A8, and their numeric value increases by 4 in hexadecimal notation (Ac, A10, A14, A18, A1c, ...). Furthermore, there are also local variable names such as Vffffffbc, which also seems to be some sort of offset from the stack pointer. The size of variables can only be determined by looking at context and neighboring values. For example, there are variables

ebp + -2048
ebp + -4096
ebp + -4536
without any values in between those. Thus ebp + -4096 is 2048 bytes and ebp + -4536 is 440 bytes. However, even though we have
*(ebp + -17616) = ebp + -2048;
*(ebp + -17620) = ebp + -2028;
*(ebp + -17624) = ebp + -2026;
I concluded later on that ebp + -2048 is a buffer of 2048 bytes and that ebp + -17620 and ebp + -17624 are merely pointers into that buffer.

As Jim had prepared a list of system call addresses from the assembly file, I decided to start putting those into the code. Basically, wherever there is a function that contained the line

asm("int 0x80");
it was likely to be a system call. For example, the function:
L08056A2C(A8, Ac, A10)
/* unknown */ void  A8;
/* unknown */ void  Ac;
/* unknown */ void  A10;
{
    /* unknown */ void  ebx;
    /* unknown */ void  Vfffffff4;
    /* unknown */ void  Vfffffff8;
    /* unknown */ void  Vfffffffc;
    Vfffffff4 = A8;
    Vfffffff8 = Ac;
    Vfffffffc = A10;
    ecx = & Vfffffff4;
    eax = 102;
    ebx = 5;
    asm("int 0x80");
    edx = eax;
    if(edx < 0) {
        *L08078B14 = ~edx;
        edx = -1;
    }
    return(edx);
}
calls system call 102 (or 0x66) and 5 is being passed as a parameter. System call 0x66 is the "socketcall" system call, and a "man 2 socketcall" revealed that the first argument is the call number. A look at <linux/net.h> (I actually did a "find . | xargs grep -d skip LISTEN" in the "/usr/include" directory to find the correct header file) revealed that SYS_ACCEPT was equal to 5. Thus I could conclude that the above function was the "accept" system call. I then replaced all invocations of "L08056A2C" with accept. I proceeded like that with all the system calls from Jim's list. If applicable, I also replaced integer numbers with the constant names:
(save)11;
(save)3;
(save)2;
L08056CF4();
became
socket(AF_INET, SOCK_RAW, NVP);
where NVP is not really a constant, but I put it in for better readability. A complete mapping for the system call functions can be found in the file system_functions.txt.

After putting every system call in, the code looked a little better (see file decompile_with_syscalls.c), but it was still too early to really learn anything from it.

Since the binary had been statically linked, large parts of the standard C library were likely to be included in it. Jim and I agreed that the next step would be to identify the functions from the standard library. Once identified, they could be properly named and their code be removed from the code file. As I had noticed earlier, some of the functions never got called, so I wrote a simple perl script that identified the number of occurrences of functions in the code (proc_check). Another script would remove functions that I specified in a file from the code (pruneit). As the removal of a function could cause other functions not being called anymore, this process needed to be iterated until no more "dead" functions could be pruned out. This technique actually reduced the number of lines from 37,228 to 22,894.

There are several ways to identify functions from the standard library. Given the library's source code, one can try to identify functions based on other functions they call, strings they contain, or constants they contain. Another method is to look at the context where a given function is called and then make an educated guess as to what function it might be and then compare the function's code with the library's source code. The easiest way is definitely using plaintext strings that are contained in the functions. However, I had started using the glibc-2.2.5 library source code as the base for my comparison. Many of the strings I found in the decompiled code could nowhere to be found. My first suspicion was that some other library other than the standard C library was compiled in as well. I also had a very hard time matching up the code as the glibc source code contains plenty of macros and #ifdefs.

Fortunately, Jim pointed out that the binary was compiled using the libc-5.3.12 library. I downloaded the source code (from ftp.linux.org.uk/pub/linux/libc/). Suddenly, my work got much much easier. Once a function had been identified, I first checked if it was calling other, unidentified functions, noted what they were and then replaced the "L080..." name with the proper one for all newly discovered functions.

Example

L0804F620 is the fopen function. I found this out doing as search of the string "/etc/resolv.conf" in the library source code. The string itself appeared in function L0804D744 as
eax = L0804F620("/etc/resolv.conf", "r");
a 'find . | xargs grep -d skip "/etc/resolv.conf"' in the root directory of the library source code didn't give any exact matches, but variable _PATH_RESCONF was defined as the string. The same kind of search for that variable name then revealed the next line, the only one that matches the above:
./inet/res_init.c:    if ((fp = fopen(_PATH_RESCONF, "r")) != NULL) {
Thus L0804D744 had to be res_init and L0804F620 fopen. To show the degree of code similarity, here is the code for L0804F620 and for fopen to compare. The res_init function looks equally similar to its L0804D744 counterpart and from there more functions, such as fgets and strncpy can be derived. From the fopen function we can then derive the malloc (L0805BD74) and free (L0805C290) calls, and so forth.


L0804F620(A8, Ac)
/* unknown */ void  A8;
/* unknown */ void  Ac;
{
    /* unknown */ void  ebx;
    ebx = L0805BD74(84);

    if(ebx == 0) {
        eax = 0;
    } else {
        (save)0;
        (save)ebx;
        L08061F34();
        *(ebx + 80) = 0x807902c;
        (save)ebx;
        L08060D24();
        (save)Ac;
        (save)A8;
        (save)ebx;
        esp = esp + 24;
        if(L08060E20() == 0) {

            L08061788();
            L0805C290(ebx, ebx);
            eax = 0;
        } else {
            eax = ebx;
        }
    }
}

_IO_FILE *
DEFUN(_IO_fopen, (filename, mode),
      const char *filename AND const char *mode)
{


  struct _IO_FILE_plus *fp =
    (struct _IO_FILE_plus*)malloc(sizeof(struct _IO_FILE_plus));
  if (fp == NULL)
    return NULL;



  _IO_init(&fp->file, 0);
  _IO_JUMPS(&fp->file) = &_IO_file_jumps;

  _IO_file_init(&fp->file);
#if  !_IO_UNIFIED_JUMPTABLES
  fp->vtable = NULL;
#endif

  if (_IO_file_fopen(&fp->file, filename, mode) != NULL)
        return (_IO_FILE*)fp;
  _IO_un_link(&fp->file);
  free (fp);
  return NULL;
}




weak_alias (_IO_fopen, fopen);
Comparison of the L0804F620 function from the decompile with the fopen function from libc-5.3.12

Comparing the two code snippets, you might notice that there is a discrepancy with the parameters of the functions that are called. We have:

L08061788();
L0805C290(ebx, ebx);
but
_IO_un_link(&fp->file);
free (fp);
This is one of a few decompiler glitches. The assembly code for this looks like this:
804f664:       53                      push   %ebx
804f665:       e8 1e 21 01 00          call   0x8061788
804f66a:       53                      push   %ebx
804f66b:       e8 20 cc 00 00          call   0x805c290
but for some reason, the decompiler associates the first ebx with the second function call. Once aware of this, I could quickly identify those glitches and rectify them. Sometimes, parameters were missing as well, but a look at the assembly code always cleared up the confusion.

For some reason, the decompiler also can't handle the modulus function if one of the operands is a function result. This results in code like:

rand();
ecx = 10;
asm("cdq");
edi = ecx / ecx % ecx / ecx;
The assembly code looks like this:
8048440:       e8 13 dc 00 00          call   0x8056058
8048445:       b9 0a 00 00 00          mov    $0xa,%ecx
804844a:       99                      cltd
804844b:       f7 f9                   idiv   %ecx,%eax
804844d:       89 d7                   mov    %edx,%edi
so the code should read:
edi = rand() % 10;
The identification of the standard C library calls and the removal of their code was a long and tedious task. After I had identified all that I could (basically, there were no more distinguishing strings, constants, function calls or context left), I pruned the code of "dead" functions once again, and the resulting file (decompile_with_syscalls.c) was down to 4217 lines of code.

My next task was to interpret the C code that was left to a more readable format. Hence I went through the code, starting at the entry point and re-wrote most of it. For most parts, the "original" code was left as a comment below the re-written one.

Ben did an analysis of what was going on at startup and he concluded that this was probably standard system initialization and that function L08048134 was "main", so I started my analysis there. The biggest challenge was understanding how the variables are used and giving them proper names. Here is a mapping of the most important variables:

char buffer[2048] :    *(ebp + -17616) = ebp + -2048;
char buffer2[2048]:    *(ebp + -17632) = ebp + -4096;
char buffer3[440] :    *(ebp + -17636) = ebp + -4536;
unsigned char r:    *(ebp + -17648);
int offset:        *(ebp + -17644);
FILE fstream:        *(ebp + -17628);
char *buffer4:        *(ebp + -17640); // turned out to be a pointer
char buffer5[504]    ebp + -17596;
struct sockaddr_in cli_sock: ebp + -4568;
char buffer6[19]:    ebp + -17340;
The final version of the interpretation can be found in the file decompile_final.c. It is a result of reading C code, reading up on network programming literature, and plenty of assistance from Ben and Jim. This is not working C code, but a person familiar with C and UNIX network programming shouldn't have any trouble following it. While interpreting, I found a few other functions from the standard C library (such as inet_addr) and removed their code. I wasn't able to identify  two functions that actually get called in the code. I named them precise_sleep and signal_action, but their purpose should be clear.

I did not interpret the function I named more_udp_stuff (an initial name I gave it that I never changed), as its functionality is the same as dos_dns_udp with the an additional option of specifying a destination address.

The function dos_dns_udp is a function that sends DNS requests with a spoofed IP address to a destination address and can therefore be used as a reflector DoS client. During its analysis, I discovered that the function reads data from the read-only data section of the binary starting at address 0x8067698. These turned out to be buffer lengths followed by DNS query packets. I wrote a perl script to extract the data (dns_extract), and the data is commented and in a pseudo C code for each packet in file dns_data.c.

Furthermore, for the destination of the DNS packets, a list of IP addresses is used that resides in the .data portion of the binary starting at address 0x806d22c (in the .asm file). It seems a random address is picked from the first 8000 entries of that list. The list itself, however, is larger than that. Again, I wrote a perl script that extracted those addresses (ip_extract). The first 8000 are contained in the file ip_addresses.txt.

This concludes the analysis portion. The answers to the questions were derived from looking at the C code that we reverse-engineered.

Tool used for reverse engineering

gdb
Reverse Engineering Compiler (REC)
less
find
grep
man

Other resources

"Unix Network Programming Vol. 1", W. Richard Stevens, Prentice Hall, 1998

"TCP/IP Illustrated Vol. 1", W. Richard Stevens, Addison Wesley, 1994

"Advanced Programming in the UNIX Environment", W. Richard Stevens, Addison Wesley, 1993

Intel i386 instruction manual

libc-5.3.12 source code