My InfoSec Ramblings: Reverse Engineering

Once you've got a fair idea on how to perform dynamic analysis its a good idea to try and start understanding how the exe/binary in question was actually built in the first place. Popularly that is what is called static analysis. The reason why we did dynamic analysis first is that it helps in you getting a great high level view of what the binary actually does and what its "real purpose" is. Plus its simpler to do..many a time you'd be able to get what you wanted by simply running an exe in a contained environment and studying its behavior. Other times you wont and you need to understand more about the binary. In such cases static analysis or reversing the binary will definitely help.

Reversing = Reading the code in which a binary was written. In some languages you can get the code back very easily for eg. Java .. where there are tools you can just feed the binary to..and get back the code. In others it isn't so easy. For eg. The code for an EXE written in C can't be got back that easily. Hence its good to learn the art of using disassemblers and debuggers and attempt to understand the assembly version of the binary.

NOTE: Don't look too much at Chapters 3,4 and 5. They're slightly out of flow and we'll refer back to them when needed. For now stick with me :D

Jump over to Chapter 6 now and read only the first 2 sections there and then we look at examples and learn. A concept IMO is best learnt through examples; instead of sitting and reading a 1000 page manual. The manual is best referred to when you feel the need to brush up on your concepts; specially true of Rev Engg where there are so many concepts all merged into one. Lets look at a little HelloWorld in Assembly. Note that I'm going really slow deliberately so we understand everything that takes place when a binary is run.

#include
int main(){
printf("hello");
}

We compile the code and run it and get the desired output:
gcc -o hello -g gdb hello.c
./hello

hello

Now lets see what happened underneath.
[arvind@dilby ~]$ gdb -q hello
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) disass main
Dump of assembler code for function main:
0x080483c4 : lea 0x4(%esp),%ecx
0x080483c8 : and $0xfffffff0,%esp
0x080483cb : pushl 0xfffffffc(%ecx)
0x080483ce : push %ebp
0x080483cf : mov %esp,%ebp
0x080483d1 : push %ecx
0x080483d2 : sub $0x4,%esp
0x080483d5 : movl $0x80484c0,(%esp)
0x080483dc : call 0x80482dc
0x080483e1 : add $0x4,%esp
0x080483e4 : pop %ecx
0x080483e5 : pop %ebp
0x080483e6 : lea 0xfffffffc(%ecx),%esp
0x080483e9 : ret
End of assembler dump.

We compiled the code to give the binary a name "hello" and also passed the code to the gdb debugger using the -g option so we can view the code inside gdb when we want to. We then start gdb with the -q(quiet) option and get the assembly version of Hello World. Clear enough? Now lets understand the assembly.

The number 0x080483c4 is the address in memory where main starts. The second column is the offset from the starting address in hex. The third column is the actual instruction itself.

Now if you read the first 2 sections of Chapter 6 carefully you'd know that there is something called the function prolog that is mentioned there. Effectively saying, before any function actually starts to happen - the stack MUST do something with it. What it does is save the current value of ebp so it can be reused sometime. it then checks how many local variables are declared inside the function, main() in this case, calculating how much space is needed for each of them and then pre-allocating that amount on the stack by decreasing the value of esp(the stack pointer). The value of esp will change all the time, while ebp WILL stay constant throughout.

Quoting now a few golden lines from Chapter 6:
Reading Assembly - Keep track of the stack and registers --- The secret to understanding assembly code is to always work with a sheet of paper and a pencil. When you first sit down, draw out a table for all 6 registers A, B, C, D, SI, and DI. Keep track of the high and low portions as well. Each new line of this table should represent a modification of a register, so the last value in each register column is the current value of that register.Next, draw out a long column for the stack, and leave space on the sides to place the BP and SP registers as they move down. Be sure to write all values into the stack as they are placed there, including ret and the stored BP. If you're just starting off with rev engg like me I'm quite sure this is still confusing to you. No problem - it'll get cleared up as we go along.

Now back to the program, we want to see what happens at every single point so we put a breakpoint inside gdb on the first line itself as follows.
(gdb) br 1
Breakpoint 2 at 0x80483c4: file hello.c, line 1.
(gdb)

Now we run the program and it breaks as expected immediately:
(gdb) r
Starting program: /home/arvind/hello
Breakpoint 2, main () at hello.c:2
2 int main(){
(gdb)

Now the first instruction in the assembly code is ----- lea 0x4(%esp),%ecx . This means load the address at esp+4 into the ecx register. Immediately now we go to our pen and paper and look at the values for both esp and ecx and write them down.
(gdb) x/xw $esp+4
0xbfd3ffd0: 0x00000001
(gdb) x/xw $ecx
0xa2bffcc4: Cannot access memory at address 0xa2bffcc4

When we want to look at the value at an address/register we use the $ in front of it. Now if you notice , the value at esp+4 is 1 but ecx doesn't show the same value as expected. That's because we had a breakpoint at the very start of the program and the first instruction never executed. So lets execute that and check ecx again.

(gdb) nexti
0x080483c8 2 int main(){
(gdb) x/xw $ecx
0xbfd3ffd0: 0x00000001
(gdb)

Bingo! ecx now has 1 . The gdb command nexti just says ; I've executed 1 instruction , now the next instruction is at 0x080483c8 . Clear enough? Lets go on to the next instruction now. Oh wait..did you write that down the values for esp and ecx on paper? ;) . Lets do that for a while till we're really clear on what we're doing. Again:
(gdb) x/xw $esp
0xbfd3ffcc: 0x004d6390
(gdb) x/xw $ecx
0xbfd3ffd0: 0x00000001

The current value of esp is bfd3ffcc and that of ecx is bfd3ffd0 (because the address of esp+4 was loaded into it). You can use a hex calculator to cross check the hex as well ; side by side. esp +4 = bfd3ffcc+4 = bfd3ffd0. So all is well. Moving on then..we run nexti to execute the current instruction ...which is

and $0xfffffff0,%esp and translates to and fffffff0, 0xbfd3ffcc which comes out to bfd3ffc0 . Note the $ just before 0xfffffff0 ? That says its a value and not an address. A inary and brings the result of bfd3ffc0 which is then stored in esp. Moving on..

(gdb) nexti
0x080483cb 2 int main(){
(gdb) x/xw $esp
0xbfd3ffc0: 0x004bcca0

The next instruction allocates space on the stack for some local variable -- pushl 0xfffffffc(%ecx). It doesn't change ecx or anything, it just allocates space for the future. pushl takes 4 bytes so you should now see esp go down by 4....

(gdb) x/xw $esp
0xbfd3ffbc: 0x004d6390
(gdb)

Yep.. Its bfd3ffbc .. 4 less than its previous address bfd3ffc0. Similarly the next instruction is push %ebp makes space for another 4 bytes taking the value to 0xbfd3ffb8. Do you have 4 values for esp on your paper now??

(gdb) nexti
0x080483cf 2 int main(){
(gdb) x/xw $esp
0xbfd3ffb8: 0xbfd40028

The next is a mov instruction - mov %esp,%ebp ... which moves the address from esp into ebp(the base pointer) ; so after this executes the values for esp and ebp must be the same. Lets see..
(gdb) nexti
0x080483d1 2 int main(){
(gdb) x/xw $esp
0xbfd3ffb8: 0xbfd40028
(gdb) x/xw $ebp
0xbfd3ffb8: 0xbfd40028
(gdb)

Yep.. no problem. All normal. The next is another push which will bring down esp's value..note that ebp doesn't change. ..
(gdb) nexti
0x080483d2 2 int main(){
(gdb) x/xw $esp
0xbfd3ffb4: 0xbfd3ffd0
(gdb) x/xw $ebp
0xbfd3ffb8: 0xbfd40028

Then there's a sub instruction which further decreases esp by 4 -- sub $0x4,%esp ..to bfd3ffb0
(gdb) nexti
Breakpoint 1, main () at hello.c:3
3 printf("hello");
(gdb) x/xw $esp
0xbfd3ffb0: 0x004af940

Hmm..notice that the next instruction is a printf?? Lets now look at our code..inside gdb..
(gdb) list
1 #include
2 int main(){
3 printf("hello");
4 }
Yes...our code is only now..STARTING TO EXECUTE... notice all those little things that happen in the background..before even your first line of code executes? Very exciting to know..for me anyway :) . Well lets go on...you're still writing down..right? Moving on then.. the next inst is another move..

movl $0x80484c0,(%esp)
The $ signifies that the value 0x80484c0 is put into esp. NOT the value at the address 0x80484c0 . Important that we understand this..A look at the stack confirms this..
(gdb) x/xw 0x80484c0
0x80484c0 <__dso_handle+4>: 0x6c6c6568
(gdb) x/xw $esp
0xbfd3ffb0: 0x080484c0

Now...if you look at ur disassembly you'll see that the next instruction is a call to the actual printf function , whenever you see this you need to remember that you dont need to step through all the instructions in the printf call itself..just stick to your own program. This'll get clearer when we look at some little code with functions in it.

printf has executed successfully and returned to our code, the next inst being an add instruction and then two pop instructions.. which means esp is incremented by 4 three times...keep hitting nexti till the next instruction shows up as 0x080483e6 and then lets check the value if esp...is should be .. bfd3ffb0 + 12(decimal) =bfd3ffb0 + C(hex) = bfd3ffbc...

(gdb) nexti
0x080483e4 4 }
(gdb) nexti
0x080483e5 4 }
(gdb) nexti
0x080483e6 in main () at hello.c:4
4 }
(gdb) x/xw $esp
0xbfd3ffbc: 0x004d6390
(gdb)

Great. Notice we're going up and reaching the earlier addresses again? Thats a sign that the program is completing. Eventually we should reach ebp again. The next is .. lea 0xfffffffc(%ecx),%esp which causes the stack to go up again by 4 and reach bfd3ffc0..and also load a new address into esp...just where it left off..

(gdb) nexti
0x080483e9 4 }
(gdb) x/xw $esp
0xbfd3ffcc: 0x004d6390

We then close off with a ret...and the program then exits after printing the hello which is what it was supposed to do. Note that the value of ebp never changed, all the addresses were referred to wrt esp. There will still be a few questions in your minds I guess.. hopefully the future posts where we take a look at more programs from Chapter 6 will clear those up. Hope you enjoyed and understood this very basic introduction to assembly :)

My InfoSec Ramblings

Sunday, April 4, 2010

Reverse Engineering - 3

No comments:

Everything else :)

Blog Archive

About Me