ClamAV® blog: malware

Showing posts with label malware. Show all posts

Friday, November 4, 2011

Bytecode signatures for polymorphic malware

About one year ago Alain presented the LLVM-based ClamAV bytecode. We've realised that, besides that initial introduction, we've never shown any real life use case, nor did we ever demonstrate the incredible power and flexibility of the ClamAV bytecode engine. I'll try to fix that today.

I decided to target the Xpaj virus because it's an polymorphic file infector, which means that it is not easily to detected with plain signatures.
Please note that I'm just focusing on the detection of Xpaj via bytecode signatures, not on Xpaj itself which was already thoroughly reviewed and explained.

Pic.1: Clean file

Pic.2: Same file as above, but infected with Xpaj

For the scope of this blog post, it suffices to say that Xpaj is a file infector targeting 32-bit Windows executables and DLLs which employs entry-point obfuscation (EPO) capabilities in order to make the detection harder. In particular, the virus code hijacks a few API calls in the .text section of the file, diverting them to its own routine.

This routine is located within the .text section and consists of a series of small chunks of code connected by jumps. Most of that is “garbage”. The only thing this preliminary block of code does is compute the code address for the next stage and jump to it. The actual viral code, as well as the overwritten blocks, are stored, in encrypted form, inside the data section.

Well... enough technical info already. From now on I'll just focus on the Xpaj detection, or rather, the detection of a rather simplified version of it in order to keep this blog post small and readable. The geeks can find the full source code here.

Let's start with a look at the virus entry point code:

push   ebp
mov    ebp, esp
sub    esp, XX

While these are technically enough bytes to create a signature based on the opcodes, such a signature would be a really bad idea. What we have there, in fact, is just a pretty standard function entry point.

After that we have some optional trash (do nothing) code, and then the virus saves the content of 3 random registers, which will be clobbered later by both the virus code and the trash engine too.

So far we can still get away with a signature that makes use of a wildcard, however we still don't have much: stack allocation and 3 registers saved. That's still not enough.

Next, we've got the trash engine in all its glory, and eventually we reach a function call.
The trash code may or may not jump to another chunk of code. And that effectively kills our ability to use a normal (ndb or ldb) signature.

Not all is lost, though. We can still write a small piece of bytecode signature which follows the code through the trash and checks for specific fingerprints.

In particular we plan to scan the code section for something that looks like the following:

mov         edi, edi
push        ebp
mov         ebp, esp
sub         esp, $STACKSIZE
[optional trash]
push        eax (*)
push        edx (*)
push        edi (*)


(*) note, the registers are chosen randomly among the 32 bit general purpose registers except esp and ebp

[optional trash]
call        $DELTA

Here we are inside "$DELTA"..

[optional trash]
mov         register, [ebp-stacksize]
[optional trash]
ret

Back outside the call we have a couple of other less interesting fingerprints and eventually the virus will jump to some runtime computed location. There are two ways by which this is achieved:

jmp         local_var

push        local_var
ret

Ok let's code...

First we look for the 5 static bytes at the virus entry point (EP):

seek(begin_of_the_code_section, SEEK_SET);
cur = file_find_limit("\x55\x89\xe5\x83\xec", 5, end_of_the_code_section);
if(cur < 0) return 0;

Then we set ourselves in a disassembly loop and we check if we got what we expect. Something along the lines of:

while(1) {
 struct DIS_fixed d;
 int next = DisassembleAt(&d, cur, space_remaining);
 if(next == -1) break; /* disasm error */
 cur = next; /* cur now points at the next op */
 [here we check the op]
}

As for the actual opcode matching, here are a few examples. The first thing we are interested in is the 3 pushes. In terms of bytecode we need to check that:

1. the opcode is OP_PUSH
2. the argument is a register
3. the register is one of (eax, ebx, ecx, edx, esi, edi)

In BC that'd be:

d.x86_opcode == OP_PUSH
d.arg[0].access_type == ACCESS_REG
d.arg[0].u.reg == REG_EAX || d.arg[0].u.reg == REG_ECX || d.arg[0].u.reg == REG_EDX || d.arg[0].u.reg == REG_EBX || d.arg[0].u.reg == REG_ESI || d.arg[0].u.reg == REG_EDI

Altogether:

if(d.x86_opcode == OP_PUSH && d.arg[0].access_type == ACCESS_REG && (d.arg[0].u.reg == REG_EAX || d.arg[0].u.reg == REG_ECX || d.arg[0].u.reg == REG_EDX || d.arg[0].u.reg == REG_EBX || d.arg[0].u.reg == REG_ESI || d.arg[0].u.reg ==  REG_EDI))

Then we need to check for the call $DELTA. In other words we check that:

1. the opcode is a call
i.e.: d.x86_opcode == OP_CALL
2. the argument is an immediate relative value
i.e.: d.arg[0].access_type == ACCESS_REL

Then we pick the call target and we "jump" to it, not before saving the return address:

int32_t target_address, return_address;
seek(cur-4, SEEK_SET); /* we position onto the call argument */
read(&target_address, sizeof(target_address)); /* we read the relative jump value */
target_address = le32_to_host(target_address); /* we handle big endian machines */
retaddr = cur; /* we save the address to return to */
target_address = cur + target_address; /* we compute the addres to jump to */

Another interesting example is the trash code parser. There can be 3 types or trash ops:

A. Arithmetic or logic operation on a stack allocated DWORD based on an immediate or register value. Eg:

mov [ebp-xx], immed
add [ebp-xx], register

B. Arithmetic or logic operation on a 32bit register based on a stack allocated DWORD or an immediate value. Eg:

mov register, [ebp-xx]
sub register, other_register

C. A jump to the next chunk of code.Eg:

jmp next_chunk

More in details, for case A we check that:

1. d.x86_opcode is one of (OP_ADD, OP_ADC, OP_AND, OP_MOV, OP_OR, OP_SBB, OP_SUB, OP_XOR), i.e.:

d.x86_opcode == OP_ADD || d.x86_opcode == OP_ADC || d.x86_opcode == OP_AND || d.x86_opcode == OP_MOV || d.x86_opcode == OP_OR || d.x86_opcode == OP_SBB || d.x86_opcode == OP_SUB || d.x86_opcode == OP_XOR

2. the dest argument is a mem region:

d.arg[0].access_type == ACCESS_MEM

3. the access size is a DWORD:

d.arg[0].u.mem.access_size == SIZED

4. the dest argument is in the form [ebx-displacement]:

d.arg[0].u.mem.scale_reg == REG_EBP && d.arg[0].u.mem.scale == 1 && d.arg[0].u.mem.add_reg == REG_INVALID

5. the displacement fits within the local funcion stack:

d.arg[0].u.mem.displacement <= -4 && d.arg[0].u.mem.displacement >= -(int32_t)stacksize

6. the source argument can be anything (i.e. a register or an immediate value): nothing to check!

Case B is very similar, except the arguments are reversed:

1. The dest argument is a register:

d.arg[0].access_type == ACCESS_REG

2a. The src arg is either another reg:

d.arg[1].access_type == ACCESS_REG

2b. Or it is an immediate:

d.arg[1].access_type == ACCESS_IMM

2c. Or it is a stack based DWORD:

d.arg[0].access_type == ACCESS_MEM && d.arg[0].u.mem.access_size == SIZED && d.arg[0].u.mem.scale_reg == REG_EBP && d.arg[0].u.mem.scale == 1 && d.arg[0].u.mem.add_reg == REG_INVALID && d.arg[0].u.mem.displacement <= -4 && d.arg[0].u.mem.displacement >= -(int32_t)stacksize

Finally, case C... Here we:

1. Check that the op is a jmp:

d.x86_opcode == OP_JMP

2. Check that it's got an immediate argument:

d.arg[0].access_type == ACCESS_REL

3. Then we can "jump" to the next position:

int32_t rel;
seek(cur-4, SEEK_SET); /* move onto the jmp argument */
read(&rel, sizeof(rel)); /* read it */
rel = le32_to_host(rel); /* make it big endian safe */
cur += rel; /* "jump" to it */

Blog post by Alberto Wu.

Tuesday, January 20, 2009

New Clamav-milter for ClamAV 0.95

ClamAV 0.95, which is currently scheduled for release by Sourcefire in March 2009, will include a redesigned and completely rewritten clamav-milter.

Developers and keen users of ClamAV may have noticed that the version of clamav-milter within the SVN repository has changed a lot. We want to let you know what we’ve done and why.

The most notable difference in the new clamav-milter is that the internal mode has been dropped which means that now you will need to run clamd. This has not only allowed us to keep clamav-milter compact and readable, but also it avoids a lot of code duplication. With the old clamav-milter, internal mode was almost the same as having an outdated clamd with a milter interface because we were not keeping the code up-to-date with clamd’s API.

The second important difference is that now clamav-milter has its own configuration and log files that replace the large number of command line switches in the previous version. To ease the difficulty of learning another configuration file, the new clamav-milter comes with a program that will generate a configuration file from your existing command line options and clamd.conf file.

Some features are no longer supported:

Notifications
Black-listing
Phish false positive prevention by use of a subset of SPF
Scanning information is no longer added to the email headers by default
Scanning and other information can no longer be added to message bodies

So Why Has This Been Done?

Nigel Horne, the program’s previous author, is no longer a member of ClamAV’s engineering team – he is now ClamAV’s product manager. The milter program did not support many new features included within ClamAV and hence clamav-milter was starting to lag behind and bugs were not being addressed. The code was over 7500 lines and we felt it was a great opportunity to rewrite the code from scratch to be more closely coupled with the rest of ClamAV.

As a result we have been able to support new features, including:

Clamav-milter can now run as a completely unprivileged user (e.g. nobody)
Quarantine has been reworked to use the native milter interface on later versions of Sendmail and Postfix that support it
White-listing now uses regular expressions, replacing strict matching
Support for Postfix has been added in addition to Sendmail
Full IPv6 support.

The new milter’s configuration file is designed to be consistent with the configuration file for clamd, allowing you to fine-tune specific configurations and to route log messages to a dedicated file.

The new milter supports load balancing to copies of clamd in a round-robin fashion. Should one instance of clamd temporarily go down, clamav-milter will issue probe requests every few minutes and the instance will be re-entered into the pool as soon as it becomes available again. Scan requests to remote clamds are performed via the STREAM command, while requests to a local scanner are (preferably) sent via a FILDES command (file descriptor passing over a UNIX socket). This allows systems’ administrators to run clamav-milter and clamd as different users.

And last, but not least, if you prefer, you can continue to use the old version which is kept under …/contrib/old-clamav-milter.
The new clamav-milter will supersede the old one in ClamAV 0.95. Whilst the previous version will still be available, it will no longer be supported.

Installation Instructions

Run ./configure --enable-milter and make as usual.

There are two ways to configure the new clamav-milter:

Use the example clamav-milter.conf that we have provided as a template for your configuration. It's well documented, but if something's not clear please report it to bugs.clamav.net and we’ll fix it.
Run the make-clamav-milter-conf.pl script with the same run-time arguments you currently pass to clamav-milter and a configuration file will be generated automatically.

Known Issues

The white-list format now uses regular expressions; the old format was a list of strings wrapped in “<>”, the white-list file will therefore need to be edited. We plan to add automated conversion of the file to later versions of the configuration converter script, in the meantime the file will need to be edited by hand.

The round-robin clamd selection requires more work. It works well with either one or a high number of instances of clamd, but the round-robin strategy is limited when the number of scanners is as low as two or three.

And Finally…

The new milter is currently a work in progress. Although we’ve tested it with several hundred GBs of emails, real-life situations are usually more complex than inside the lab. We’re working hard to ensure that clamav-milter is portable to more operating systems.

Please send us your feedback on the new program by adding a comment to this blog; we’re really interested to know what you think!

Monday, December 8, 2008

Catching Swizzor

Users of ClamAV’s cutting edge SVN release may have noticed that on 2^nd December we added heuristic support to catch the Swizzor Trojan.

Released in late 2004, Swizzor downloads and installs Adware and other Trojans and installs them on the infected machine. Just browsing websites can infect your PC if it is not properly patched or protected.

You may have been wondering why we’ve decided to include the heuristic algorithm in the engine rather than continue writing signatures to catch it. Swizzor is clever in the way that it changes itself so often and can mimic standard (and therefore clean) Windows programs. There are nearly 1000 signatures for Swizzor in the ClamAV signature database, yet nearly four years after it was written we are still receiving undetected samples.

By writing an algorithm to detect Swizzor and including that algorithm into the anti-virus engine of ClamAV we will save a lot of effort writing signatures, and new variants will be caught as soon as they are created. So far we’ve found no sign of false positives from the algorithm.

The variant of Swizzor that has been in the wild since early in 2008 has proven particularly difficult to find because it adds strings throughout itself that are almost random. We hit upon the idea of detecting Swizzor’s variants by analyzing these strings in the program. Although the strings are gibberish, somehow they looked to us as though they are automatically generated. At first we thought these strings, looking almost random, would be impossible to detect; but after some careful examination of the strings’ ngrams we were able quickly to generate a heuristic rule by building a decision tree using data mining.

The algorithm built into ClamAV shows over 83% detection rate on Trojan.Swizzon.Gen with no false positives, but the battle carries on to improve the detection rates even further.

Subscribe to: Posts ( Atom )