Sunday, August 22, 2010

Reverse Engineering - 5

Okay so lets look at a program with a couple of control structures like if, if-else while and for, and see how they look in assembly. We won't go through the prolog again and we'll assume that you've understood the previous 2 posts.

The first bit of code we look at is as follows(its called ifelse.c on that site):

gcc -o ifelse -ggdb -O0 -fno-builtin-printf ifelse.c

Lets step through the assembly. First the prolog and space allocation:
0x08048384 : lea 0x4(%esp),%ecx
0x08048388 : and $0xfffffff0,%esp
0x0804838b : pushl 0xfffffffc(%ecx)
0x0804838e : push %ebp
0x0804838f : mov %esp,%ebp
0x08048391 : push %ecx
0x08048392 : sub $0x14,%esp

Then the (a>0) bit:
0x08048395 : cmpl $0x0,0xfffffff8(%ebp)

Then the bit which has the if-else logic. The instruction jns stands for (Jump if no sign), which effectively means 'positive'. So if the value of a is positive you jump straight to 0x080483a9, else you go through the instructions from 0x08048399 to 0x080483a7.
0x08048399 : jns 0x80483a9
0x0804839b : movl $0x80484a0,(%esp)
0x080483a2 : call 0x8048298
0x080483a7 : jmp 0x80483b5
0x080483a9 : movl $0x80484b4,(%esp)

That leaves just the final printf. The if-else loop terminates just 1 instruction before that. Which means...
0x080483b5 : movl $0x80484d4,(%esp)
0x080483bc : call 0x8048298
0x080483c1 : add $0x14,%esp
0x080483c4 : pop %ecx
0x080483c5 : pop %ebp
0x080483c6 : lea 0xfffffffc(%ecx),%esp
0x080483c9 : ret

..all of this are the bits after the if-else loop. The program prints the 'Leaving main' bit and exits. Simple eh?
-------------------------------------------------------------------------------------------------------------------------------------
Now for an example on while. We use this code.
We compile this as usual - gcc -o while -O0 -fno-builtin-printf -ggdb while.c and then open it up with gdb.

You see the usual prolog which I won't go into. You then see a line allocating space on the stack as follows.
0x08048392 : sub $0x24,%esp

At this point I'd just like to make one small point. These numbers (0x24 in this case)...you will see that there's just one local variable(int i) in the code. So why should 0x24 be allocated? Well..the only explanation is because of the "%d\n" in the printf statement below. The moment you comment that out and recompile, the line changes to 0x10..meaning I allocate only for int i. Something like this:
0x08048362 : sub $0x10,%esp

Although I'm still not sure why its 0x10. It should be lesser, logically speaking. But maybe that's how gcc deals with stuff. More clarity in later blogs maybe..once I'm clearer ;). Moving on then..

The next line is for the i=0 bit.
0x08048395 : movl $0x0,0xfffffff8(%ebp)

Now the part for the while starts. Note a jmp straight away? You thought there should be a cmpl instruction..didn't you? Me too. Lets see what is happening. Here's all the relevant assembly lines:
0x0804839c : jmp 0x80483b5
0x0804839e : mov 0xfffffff8(%ebp),%eax
0x080483a1 : mov %eax,0x4(%esp)
0x080483a5 : movl $0x80484a0,(%esp)
0x080483ac : call 0x8048298
0x080483b1 : addl $0x1,0xfffffff8(%ebp)
0x080483b5 : cmpl $0x9,0xfffffff8(%ebp)
0x080483b9 : jle 0x804839e
0x080483bb : add $0x24,%esp

There's a jump to
0x80483b5 where it checks if i<=9. This is the same as .. Is i less than 10. The jle stands for Jump if Less Than or Equal to. So if the variable i is indeed less than 10, it goes into the while loop and jumps back up to 839e. The arguments for the printf statement are pushed on to the stack(recompile code without arguments here if you want to understand exactly why 0x080483a1 and 0x080483a5 work). The printf is called on 0x080483ac and i is then incremented by 1(That's the addl instruction). Once this is done, we need to recompare the new value of i; is it still <10? Guess what's the next instruction? You guessed it, its the same instruction where we jumped first..when we thought it should be a cmpl. The rest of the code is the usual stuff. ----------------------------------------------------------------------------------------------------------------------------------------
Now lets look at a for loop. We use this code.

Compile as usual using - gcc -o for -O0 -fno-builtin-printf -ggdb for.c and open it up in gdb. If you remember the code for the while, you'll notice that the assembly for this program is exactly similar to that of the while loop! That's because the for and the while loops are just two different ways of doing the same things. I'm not explaining anything here because all that I said during the while loop is true here as well. Lets look at a do-while now.
----------------------------------------------------------------------------------------------------------------------------------------
We use the following code for a do-while

Now lets have a little bit of fun. Try and think of what a do-while does. How is it different from a while? It guarantees at least one pass of the loop; because the comparison happens after the first run through. So that means, unlike the while loop example where we jumped to a cmpl 0x9 immediately after initializing i to zero, we will call printf and increment i atleast once before comparing. Makes sense? Lets look at an assembly snippet and confirm our thoughts. Here it is:
0x08048395 : movl $0x0,0xfffffff8(%ebp)
0x0804839c : mov 0xfffffff8(%ebp),%eax
0x0804839f : mov %eax,0x4(%esp)
0x080483a3 : movl $0x80484a0,(%esp)
0x080483aa : call 0x8048298 < printf@plt >
0x080483af : addl $0x1,0xfffffff8(%ebp)
0x080483b3 : cmpl $0x9,0xfffffff8(%ebp)
0x080483b7 : jle 0x804839c

Bulls eye! There is a printf at 0x080483aa and a cmpl at 0x080483b3 . Give yourself a pat if you guessed that right :)

We'll end up this basic control structure series right here. Feeling more comfortable now? Good. If not, then just step back and re-read each part. I am sure you will understand eventually.

Before moving on though, a quick note on the gcc arguments. The -O0 stands for .. "Don't use any gcc optimizations". That didn't impact this program as such, but I'll use it going forward; just so gcc does not cause funny problems. The -f stands for "Don't use any arguments when you see printf in my code". Without this, if you run gcc without the -f, the printf gets converted into a "puts" function call, which caused me a lot of pain. A nice blog over here speaks about similar problems.

The immediate next though obviously is - What other gcc optimizations are there? How many of them are relevant? Well, here's a list. I really don't know which ones are relevant..at this point. I'm learning as much as you, as we go along. As and when I find something relevant, I'll introduce it.

Thought about looking through more sample programs on all the topics over at our parent site. However I then thought that we now know the basics and will be able to step through newer and more complex data structures as and when they come along. We just won't take each and every data structure right now. You can have a look at Chapter 7; it talks about GUI Debuggers and its advantages. I've used the free version of IDA a little..but since I hadn't any clue about any basics - I couldn't understand how to use it well. We'll use the GUI debuggers when we have a genuine need for them; i.e when we look at more complex programs.

The rest of the content online sadly is kinda incomplete so I won't be referring to that any more and will start taking up small vulnerable C programs to understand things better.

Until next time...So long!

Friday, August 6, 2010

XSS vs CSRF vs ClickJacking

Obviously there's tons of material out there explaining all three of those topics. So I'm not going to sit and talk about each of them in detail. This is just a short summary about all three, illustrating the key differences in one single post. I'm assuming that people who read this already have a fairly good idea about how all these 3 attacks work and just want a quick refresher. Here goes:

NOTE: It is assumed that you are logged in with a valid user account into the application, for all these attacks to be fully successful.

XSS - You have a website. The website accepts user input and processes it; either on the client or on the server. It however does not filter user input, disallowing special characters OR ensure that the content is encoded safely before displaying it on the browser. This results in attackers being able to inject their own scripts into:
a) Public pages that the whole world will see - Persistent
b) Pages that specific users will see - Reflected

The content of this malicious script results in the attacker stealing data or gaining complete control of the user's browser.. and if things work out well.. maybe the attacker's machine as well. User Interaction IS needed. Even viewing an infected website IS user interaction.

CSRF - So you now protect your website against XSS using the OWASP XSS Protection sheet. You still might be vulnerable to CSRF. If you have pages on your website which change data on your website(edit/modify/delete) check if those requests contain parameters whose values are unpredictable. If not then your application is vulnerable.

The aim of CSRF is NOT to inject scripts and steal information - like XSS. It is to make you perform an operation on your application, without you wanting to. For eg. Delete the Entire Administrators group in your application. You obviously don't want to do that.. right?

A CSRF request which is sent by an attacker is a perfectly normal request, hence the XSS defenses are not applicable here. The reason CSRF happens is because the attacker can predict the values of all the parameters in the "Delete Admin Group" request. So to protect yourself, you have to ensure that all your requests contain something that the attacker can't predict. Add a random token to all your requests. The attacker shouldn't be able to guess its value. You're then safe from CSRF.

Clickjacking - Appending a random token to all your requests, means that the attacker cant guess them. For carrying out a clickjacking attack though; he doesn't need to guess it. That's because you will voluntarily load a page WITH a valid token into your browser and then further shoot yourself by authorizing the "operation"; just like CSRF. So Clickjacking = CSRF + Nullifying CSRF defenses.

An attacker will create a page on his own website with a cleverly created IFRAME. You need to visit this page. The moment you do, the "Delete all admins page" will load inside this IFRAME. How? The attacker has coded that into the page with something like . Note that this is WITH the random CSRF token which the application assigned to that page. That's because YOU as a user were logged in and visited some random website while still logged in to the application. Since you're logged in, the application gave you that random token as well; the attacker does NOT have to craft a request like in CSRF The attacker now cleverly positions buttons on that page(his website) exactly under which are the buttons confirming "Delete all admins". So when you click a button on the attackers website, you also click a button confirming the "Delete all admins" operation.

So as you see - Despite protecting against XSS and CSRF, you could still be vulnerable to Clickjacking. Here are good reads on how to protect from all three attacks:

XSS - http://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet
CSRF - http://www.owasp.org/index.php/Category:OWASP_CSRFGuard_Project
Clickjacking - http://www.owasp.org/index.php/Clickjacking

Thursday, August 5, 2010

Network Mapping tool

Wrote another little tool while I was preparing for an Exam. Its called Nwmap or Network Mapper. You just have to start up a sniffer and save your packet captures in a .pcap file. Feed this to nwmap and it'll give you a list of subnets that are being used internally. You can then sit and probe all of these manually.

You can download NWmap at http://sourceforge.net/projects/nwmap/.

Sunday, April 18, 2010

Reverse Engineering - 4

We looked at the assembly version of a very simple program in the last post. Hopefully you understood most of it. Over the next 2 posts we'll take up more examples to reinforce these basics, because they'll be used all along.. all the time. We now pick up the example called functions.c from Chapter 6 but strip it a little so just 1 function is used. We'll try and understand how a single function looks on stack and then look at multiple functions. I'm using the following gcc compiler - so if you want to follow this step by step try and get the exact same version:

gcc version 4.1.2 20070925 (Red Hat 4.1.2-33)

The reason I mention this is purely because multiple versions of code are on the Chapter 6 page; meaning that different gcc versions with different switches generate slightly different assembly. While all that is important no doubt, it isn't right now when we're taking small steps towards understanding the basics. Lets go on..Here's the edited function code that I'm using:
---------------------------------------------------------
1 #include
2
3 void function3args(int a, int b , int c)
4 {
5 printf("%d %d %d\n" , a , b , c);
6 }
7
8 int main(int argc, char **argv)
9 {
10 int a;
11 int *ptr;
12 function3args(1,2,3);
13 }

---------------------------------------------------------
Like last time , lets compile it with gdb support and open up the disassembly in gdb. Oh and you have that pen and paper with those columns too..rt? ;)
[arvind@dilby ~]$ gcc -ofunc1 -ggdb functions.c
[arvind@dilby ~]$ gdb -q func1
Using host libthread_db library "/lib/libthread_db.so.1".
0x080483ed : lea 0x4(%esp),%ecx
0x080483f1 : and $0xfffffff0,%esp
0x080483f4 : pushl 0xfffffffc(%ecx)
0x080483f7 : push %ebp
0x080483f8 : mov %esp,%ebp
0x080483fa : push %ecx

Here's the status of esp for the first 6 instructions:

lea 0x4(%esp),%ecx -- No change in esp
and $0xfffffff0,%esp -- Logical and changes esp to bfc49ec0
Then there are 3 push instructions which decrease the value of the stack by 12 . So after the first 6 instructions the value of ESP is bfc49eb4 ( bfc49ec0 - 12). Just before the last push ESP is saved into EBP. This value in ebp will not change at all till it is popped and the function main ends. You can check the value of esp and ebp after each instruction by using x/xw $esp and x/xw $ebp . To advance instructions type nexti.

Then there is a sub $0x24,%esp which is to allocate space for local variables on the stack. Why 0x24? Lets look at the code in main().
0x080483fb : sub $0x24,%esp

The 3 arguments are then pushed on to the stack . Note that the arguments are passed on to the stack in reverse.
0x080483fe : movl $0x3,0x8(%esp)
0x08048406 : movl $0x2,0x4(%esp)
0x0804840e : movl $0x1,(%esp)

Note down the values for esp and ebp carefully just before executing this instruction.
0x08048415 : call 0x80483c4

Now get the disassembly for the function - function3args and lets see what happens there:
0x080483c4 : push %ebp
0x080483c5 : mov %esp,%ebp

Notice that the stored value of ebp which had remained constant during the lifetime of main is pushed on to the stack? And the current stack pointer made the current value of ebp? If there's another function after this, ebp will be pushed on to the stack again and so on. Once the last function completes the ebp's of each function are popped off till you reach the ebp of main at which point the program exits.

0x080483c7 : sub $0x18,%esp
Values for variables are allocated on the stack for the function function3args.

0x080483ca : mov 0x10(%ebp),%eax
0x080483cd : mov %eax,0xc(%esp)
0x080483d1 : mov 0xc(%ebp),%eax
0x080483d4 : mov %eax,0x8(%esp)
0x080483d8 : mov 0x8(%ebp),%eax
0x080483db : mov %eax,0x4(%esp)
Move the arguments of the function on to the stack.

0x080483df : movl $0x8048500,(%esp)
0x080483e6 : call 0x80482dc
Call the printf function with the arguments.

0x080483eb : leave
If you look at the value of ebp just after this instruction , you'd see its value change back to its earlier value which means this function has exited.

0x080483ec : ret
Exit from function3args

0x0804841a : add $0x24,%esp
0x0804841d : pop %ecx
0x0804841e : pop %ebp
0x0804841f : lea 0xfffffffc(%ecx),%esp
0x08048422 : ret
Exit from main.

Hope that clarified things a little better. Next post we won't go so much into detail, we'll make a couple of assumptions based on the previous 2 posts and learn a little more. Have fun :)

Sunday, April 4, 2010

Reverse Engineering - 3

Once you've got a fair idea on how to perform dynamic analysis its a good idea to try and start understanding how the exe/binary in question was actually built in the first place. Popularly that is what is called static analysis. The reason why we did dynamic analysis first is that it helps in you getting a great high level view of what the binary actually does and what its "real purpose" is. Plus its simpler to do..many a time you'd be able to get what you wanted by simply running an exe in a contained environment and studying its behavior. Other times you wont and you need to understand more about the binary. In such cases static analysis or reversing the binary will definitely help.

Reversing = Reading the code in which a binary was written. In some languages you can get the code back very easily for eg. Java .. where there are tools you can just feed the binary to..and get back the code. In others it isn't so easy. For eg. The code for an EXE written in C can't be got back that easily. Hence its good to learn the art of using disassemblers and debuggers and attempt to understand the assembly version of the binary.

NOTE: Don't look too much at Chapters 3,4 and 5. They're slightly out of flow and we'll refer back to them when needed. For now stick with me :D

Jump over to Chapter 6 now and read only the first 2 sections there and then we look at examples and learn. A concept IMO is best learnt through examples; instead of sitting and reading a 1000 page manual. The manual is best referred to when you feel the need to brush up on your concepts; specially true of Rev Engg where there are so many concepts all merged into one. Lets look at a little HelloWorld in Assembly. Note that I'm going really slow deliberately so we understand everything that takes place when a binary is run.

#include
int main(){
printf("hello");
}

We compile the code and run it and get the desired output:
gcc -o hello -g gdb hello.c
./hello

hello

Now lets see what happened underneath.
[arvind@dilby ~]$ gdb -q hello
Using host libthread_db library "/lib/libthread_db.so.1".
(gdb) disass main
Dump of assembler code for function main:
0x080483c4 : lea 0x4(%esp),%ecx
0x080483c8 : and $0xfffffff0,%esp
0x080483cb : pushl 0xfffffffc(%ecx)
0x080483ce : push %ebp
0x080483cf : mov %esp,%ebp
0x080483d1 : push %ecx
0x080483d2 : sub $0x4,%esp
0x080483d5 : movl $0x80484c0,(%esp)
0x080483dc : call 0x80482dc
0x080483e1 : add $0x4,%esp
0x080483e4 : pop %ecx
0x080483e5 : pop %ebp
0x080483e6 : lea 0xfffffffc(%ecx),%esp
0x080483e9 : ret
End of assembler dump.

We compiled the code to give the binary a name "hello" and also passed the code to the gdb debugger using the -g option so we can view the code inside gdb when we want to. We then start gdb with the -q(quiet) option and get the assembly version of Hello World. Clear enough? Now lets understand the assembly.

The number
0x080483c4 is the address in memory where main starts. The second column is the offset from the starting address in hex. The third column is the actual instruction itself.

Now if you read the first 2 sections of Chapter 6 carefully you'd know that there is something called the function prolog that is mentioned there. Effectively saying, before any function actually starts to happen - the stack MUST do something with it. What it does is save the current value of ebp so it can be reused sometime. it then checks how many local variables are declared inside the function, main() in this case, calculating how much space is needed for each of them and then pre-allocating that amount on the stack by decreasing the value of esp(the stack pointer). The value of esp will change all the time, while ebp WILL stay constant throughout.

Quoting now a few golden lines from Chapter 6:
Reading Assembly - Keep track of the stack and registers --- The secret to understanding assembly code is to always work with a sheet of paper and a pencil. When you first sit down, draw out a table for all 6 registers A, B, C, D, SI, and DI. Keep track of the high and low portions as well. Each new line of this table should represent a modification of a register, so the last value in each register column is the current value of that register.Next, draw out a long column for the stack, and leave space on the sides to place the BP and SP registers as they move down. Be sure to write all values into the stack as they are placed there, including ret and the stored BP. If you're just starting off with rev engg like me I'm quite sure this is still confusing to you. No problem - it'll get cleared up as we go along.

Now back to the program, we want to see what happens at every single point so we put a breakpoint inside gdb on the first line itself as follows.
(gdb) br 1
Breakpoint 2 at 0x80483c4: file hello.c, line 1.
(gdb)

Now we run the program and it breaks as expected immediately:
(gdb) r
Starting program: /home/arvind/hello
Breakpoint 2, main () at hello.c:2
2 int main(){
(gdb)

Now the first instruction in the assembly code is ----- lea 0x4(%esp),%ecx . This means load the address at esp+4 into the ecx register. Immediately now we go to our pen and paper and look at the values for both esp and ecx and write them down.
(gdb) x/xw $esp+4
0xbfd3ffd0: 0x00000001
(gdb) x/xw $ecx
0xa2bffcc4: Cannot access memory at address 0xa2bffcc4

When we want to look at the value at an address/register we use the $ in front of it. Now if you notice , the value at esp+4 is 1 but ecx doesn't show the same value as expected. That's because we had a breakpoint at the very start of the program and the first instruction never executed. So lets execute that and check ecx again.

(gdb) nexti
0x080483c8 2 int main(){
(gdb) x/xw $ecx
0xbfd3ffd0: 0x00000001
(gdb)

Bingo! ecx now has 1 . The gdb command nexti just says ; I've executed 1 instruction , now the next instruction is at 0x080483c8 . Clear enough? Lets go on to the next instruction now. Oh wait..did you write that down the values for esp and ecx on paper? ;) . Lets do that for a while till we're really clear on what we're doing. Again:
(gdb) x/xw $esp
0xbfd3ffcc: 0x004d6390
(gdb) x/xw $ecx
0xbfd3ffd0: 0x00000001

The current value of esp is bfd3ffcc and that of ecx is bfd3ffd0 (because the address of esp+4 was loaded into it). You can use a hex calculator to cross check the hex as well ; side by side. esp +4 = bfd3ffcc+4 = bfd3ffd0. So all is well. Moving on then..we run nexti to execute the current instruction ...which is

and $0xfffffff0,%esp and translates to and fffffff0, 0xbfd3ffcc which comes out to bfd3ffc0 . Note the $ just before 0xfffffff0 ? That says its a value and not an address. A inary and brings the result of bfd3ffc0 which is then stored in esp. Moving on..

(gdb) nexti
0x080483cb 2 int main(){
(gdb) x/xw $esp
0xbfd3ffc0: 0x004bcca0

The next instruction allocates space on the stack for some local variable -- pushl 0xfffffffc(%ecx). It doesn't change ecx or anything, it just allocates space for the future. pushl takes 4 bytes so you should now see esp go down by 4....

(gdb) x/xw $esp
0xbfd3ffbc: 0x004d6390
(gdb)

Yep.. Its bfd3ffbc .. 4 less than its previous address bfd3ffc0. Similarly the next instruction is push %ebp makes space for another 4 bytes taking the value to 0xbfd3ffb8. Do you have 4 values for esp on your paper now??

(gdb) nexti
0x080483cf 2 int main(){
(gdb) x/xw $esp
0xbfd3ffb8: 0xbfd40028

The next is a mov instruction - mov %esp,%ebp ... which moves the address from esp into ebp(the base pointer) ; so after this executes the values for esp and ebp must be the same. Lets see..
(gdb) nexti
0x080483d1 2 int main(){
(gdb) x/xw $esp
0xbfd3ffb8: 0xbfd40028
(gdb) x/xw $ebp
0xbfd3ffb8: 0xbfd40028
(gdb)

Yep.. no problem. All normal. The next is another push which will bring down esp's value..note that ebp doesn't change. ..
(gdb) nexti
0x080483d2 2 int main(){
(gdb) x/xw $esp
0xbfd3ffb4: 0xbfd3ffd0
(gdb) x/xw $ebp
0xbfd3ffb8: 0xbfd40028

Then there's a sub instruction which further decreases esp by 4 -- sub $0x4,%esp ..to bfd3ffb0
(gdb) nexti
Breakpoint 1, main () at hello.c:3
3 printf("hello");
(gdb) x/xw $esp
0xbfd3ffb0: 0x004af940

Hmm..notice that the next instruction is a printf?? Lets now look at our code..inside gdb..
(gdb) list
1 #include
2 int main(){
3 printf("hello");
4 }
Yes...our code is only now..STARTING TO EXECUTE... notice all those little things that happen in the background..before even your first line of code executes? Very exciting to know..for me anyway :) . Well lets go on...you're still writing down..right? Moving on then.. the next inst is another move..

movl $0x80484c0,(%esp)
The $ signifies that the value 0x80484c0 is put into esp. NOT the value at the address 0x80484c0 . Important that we understand this..A look at the stack confirms this..
(gdb) x/xw 0x80484c0
0x80484c0 <__dso_handle+4>: 0x6c6c6568
(gdb) x/xw $esp
0xbfd3ffb0: 0x080484c0

Now...if you look at ur disassembly you'll see that the next instruction is a call to the actual printf function , whenever you see this you need to remember that you dont need to step through all the instructions in the printf call itself..just stick to your own program. This'll get clearer when we look at some little code with functions in it.

printf has executed successfully and returned to our code, the next inst being an add instruction and then two pop instructions.. which means esp is incremented by 4 three times...keep hitting nexti till the next instruction shows up as 0x080483e6 and then lets check the value if esp...is should be .. bfd3ffb0 + 12(decimal) =bfd3ffb0 + C(hex) = bfd3ffbc...

(gdb) nexti
0x080483e4 4 }
(gdb) nexti
0x080483e5 4 }
(gdb) nexti
0x080483e6 in main () at hello.c:4
4 }
(gdb) x/xw $esp
0xbfd3ffbc: 0x004d6390
(gdb)

Great. Notice we're going up and reaching the earlier addresses again? Thats a sign that the program is completing. Eventually we should reach ebp again. The next is .. lea 0xfffffffc(%ecx),%esp which causes the stack to go up again by 4 and reach bfd3ffc0..and also load a new address into esp...just where it left off..

(gdb) nexti
0x080483e9 4 }
(gdb) x/xw $esp
0xbfd3ffcc: 0x004d6390

We then close off with a ret...and the program then exits after printing the hello which is what it was supposed to do. Note that the value of ebp never changed, all the addresses were referred to wrt esp. There will still be a few questions in your minds I guess.. hopefully the future posts where we take a look at more programs from Chapter 6 will clear those up. Hope you enjoyed and understood this very basic introduction to assembly :)

Wednesday, January 27, 2010

Reverse Engineering - 2

The last post , we just felt around a little bit . The main things we understood were:

--Dynamic Runtime Program Analysis
--What Rev Engg was
--Compiling a program and the steps involved

Effectively, when you compile a program , you convert the code you wrote into a form which you can use to do something you couldn't do manually or which would have taken far too much time. When reversing you only have the final form ; the final binary/executable and need to find out exactly what it did. Assuming that you already did the dynamic analysis that Lenny Zeltser discussed ; the next step is to find out as much as you can about the program , the environment it is running in and the other components that make it run. During the first few blog posts I will be referring only to Linux as I'm far more comfortable with it than Windows.

When a Linux binary is run , it becomes a process which consumes resources on the host. While doing so it receives something called a PID(Process ID). The details about the various resources that the binary consumes are stored in the /proc folder on Linux. Lets look at one process entry for a running process ; say sshd (The SSH Daemon). Here is what a ps aux listing for ssh gives:
root 1501 0.0 0.2 6064 1080 ? Ss Jan25 0:06 /usr/sbin/sshd

The number 1501 will be a directory in /proc . Inside /proc/1501 will be all the resources that sshd consumes.
cmdline: Contains the command that started the process, with all its parameters. If its malware that's running this is a good place where you can get all the options the malware was started with.
[root@dilby 1501]# more cmdline
/usr/sbin/sshd

environ: Shows all environment variables for the process and all its child processes.
[root@dilby 1501]# more environ
SELINUX_INIT=YESCONSOLE=/dev/console
The environment variables aren't really separated clearly; here the environment variables are:
SELINUX_INIT and CONSOLE . YES and /dev/console are its values. These can be clearly listed as follows:
[root@dilby 1501]# cat /proc/1501/environ | tr '\0' '\n'
SELINUX_INIT=YES
CONSOLE=/dev/console

fd:
File descriptors for input , output and error for each process. In case a process is redirecting output somewhere , you know where. Here's a sample listing for the 1501 process. 0(input) , 1(output) and 2(error) are all redirecting to /dev/null (black hole) means this is a daemon. Its also making some network call as can be seen by 3(socket:some number)
lrwx------. 1 root root 64 2010-01-29 15:12 0 -> /dev/null
lrwx------. 1 root root 64 2010-01-29 15:12 1 -> /dev/null
lrwx------. 1 root root 64 2010-01-29 15:12 2 -> /dev/null
lrwx------. 1 root root 64 2010-01-29 15:12 3 -> socket:[5715]


If you want to confirm that 5715 is something(socket) that actually does belong to SSH you can run netstat as follows.
[root@dilby ~]# netstat -ae | grep -v -i unix
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State User Inode
tcp 0 0 *:ssh *:* LISTEN root 5715

Ah its an inode. SSH is using an inode for its socket communication. Anytime hence you want to find out if a process is doing something over the network; look for socket fd's in here.

maps
: Deals with the memory in use by the process and addressable areas by the process and its dependencies. This will not make much sense just now, when we get to actually looking at ASM it'll help.

status
: Provide information about the status of the process. Here's a sample:
Name: sshd
State: S (sleeping)
Tgid: 1501
Pid: 1501
PPid: 1

Apart from this, there's plenty of other information that you can get in the /proc directory. Discussing it at this point though, won't be too beneficial so I'll skip it.

What type of file is it? Is it a known file format? Does it have any dependencies?
Use file or ldd to find out. Here's an example:
[root@dilby 1501]# file ~arvind/a.out
/home/arvind/a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped

ldd , if it gives you a long list is saying - This is a dynamic file and needs these libraries on your system to function properly. Here's an example:
[root@dilby 1501]# ldd ~arvind/a.out
linux-gate.so.1 => (0x00110000)
libc.so.6 => /lib/libc.so.6 (0x004c0000)
/lib/ld-linux.so.2 (0x004a1000)

If it were statically compiled (all libraries prepackaged into the binary) then u'd get very different messages.
[root@dilby arvind]# gcc -static a.c
[root@dilby arvind]# file a.out
a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, for GNU/Linux 2.6.9, not stripped
[root@dilby arvind]# ldd a.out
not a dynamic executable

For an initial analysis of the file on disk, that'll do. You probably want to check if the file is communicating with the network around it. The socket inode which we discussed above is one way. Another way is to look at lsof and netstat to look at active connections. Key options for netstat are:
-a - All connections
-n - All entries in numbers
-p - Program using that connection
-e - Includes Inode number on file system used by that connection
-l - Only listening connections
-r - Current routing table
-c - Repeat netstat every . Useful when u want to check new connections.

Running tcpdump or wireshark at runtime also is helpful for viewing greater detail. Here is a great cheat sheet for the same. We'll start dipping our feet into Assembly language next time.

Tuesday, January 26, 2010

Reverse Engineering - 1

We started discussing Rev Engg here. What we will do in this first post is take a gentle look into a little terminology that we'll encounter down the road. I won't touch Windows right now - coz the basics are best learnt by using all of the open source tools that are available on Linux systems. The only requirement hence is a Linux system - Ubuntu works well although all the necessary tools can be found on a RedHat or probably any other Unix system as well.

Before doing that however what I'd like you guys to do is to think of how you could analyze a trojan. First thing that comes to mind is -- Run it and see what it does. After all nothing like seeing it in action..rt? There's a couple of problems with that even a beginner like me can think of:

a) Need to be very careful so it doesn't damage any other systems at all.
b) There's numerous hidden mechanisms that might not be activated by just running it.

Problem a) could possibly be solved by carefully creating an isolated environment and ensuring that system doesn't interact at all with the outside world. Problem b) is a toughie though - Unless you have the code of the malware in front of you; you can't be sure that you found everything.

The advantages though are that you get a birds eye view of a lot of the key features of a trojan - something that would have taken much longer had you sat down with a million lines of assembly code. This entire study of runtime trojan analysis is called Dynamic Code Review. While this series will primarily focus on understanding malware through assembly language - it is a great idea to run through Lenny Zeltser's - Introduction to Malware course first. Once you're done, continue reading the rest of this post.

Caught your eye ..didn't it? Not surprised at all ;). Great now that you have a fair idea of what to expect with malware lets get down to understanding actual reversing via assembly language. The only structured free work I could find online was over here. That guide while very cool is a little difficult to follow at times. So what I'm going to do is use that as a base - and try and elaborate wherever needed so we get the maximum possible benefit and learn as much as we can. I'm going to shamelessly link there(like I did above) wherever its needed and I feel that I cannot put things any better than they already have. Wherever needed I'll elaborate a little more - The whole idea really is to get the flow of learning this subject absolutely perfect. Well lets go now!

Chapters 1 and 2 are very well written, they are great introductions to the nuts and bolts of the subject itself. Nothing to add here , just go ahead and read the whole of those and drop back here.

Ok great - At this point I'm just going to go over what all we must be clear on before we move forward.
--- What is reverse engineering and what you are in for.

--- An understanding of the compilation process of a C program; including all the terminology used there. Since you don't want to keep referring back to all those basic definitions which are very important none the less, I made a glossary sheet which I will keep adding to as I learn more and more.

Chapter 3 talks about getting a lot of information about the processes that run on your system. I will discuss that in greater detail in the next part. I will be going into just a little bit more detail than Chapter 3 there. Stick around.

Reverse Engineering - Introduction

Reverse Engineering - Series

I've been trying to learn Reverse Engineering for quite a while now. Granted; its one of the tougher subjects to learn, but the amount of literature there is out there is not really very well organized. I have invariably found myself giving up on it somewhere down the line due to the lack of direction on how to proceed. What I am trying to do now is start right from the basics yet again - This time i plan to document the approach much better than I have done. So atleast the next time I have some kind of a reference point to start from. I am not sure how long this will take or how many parts this will contain. All I plan to do here is to put down my learnings in an organized fashion so people new to this field do not struggle as much as I have and do not go down all the wrong paths of learning.

There are a few things that I have always got out of all those Reverse Engineering Tutorials I have read. This is a list of the same.
a) RTFM - Politely tells you to read a lot
b) Learn how to debug - Here ppl will rave a lot about Softice and Olly and W32 dasm and give examples
c) Learn assembly programming using NASM or something else - Will point you to a book in Assembly programming
d) Understand all the Intel syntax for instructions - Will point you to an Intel site
e) Solve crackme's - Little executables put together with a little bit of protection which you have to break
f) Examples - Many people will show you how they cracked something

Well, all of this no doubt is correct. But for a person like me, its still all too directionless and there is no one best way to learn all this. What to take first? How to begin? I know I always had those questions in my mind and still do. However I have now started on a path that I hope is correct. Over the next few articles I hope to blog as I learn. I'm still a novice , so do point out the mistakes I make and I'll correct them as I go on.