Wednesday, February 12, 2014

Generating ClamAV Signatures with IDAPython and MySQL

Covering malware is a constant fight and the more automation you can integrate, the easier life becomes. This post will go over a relatively easy setup for generating ClamAV signatures based on a set of samples.

I chose to work with OSX malware, specifically targeting Mach-O files. This would give me a relatively small sample set to work with. I downloaded the files from VirusTotal using the search type:macho positives:5+. At the time of download, this yielded 239 samples.

The first problem was grouping samples. Grouping the samples would allow to generate a single signature for multiple samples. One signature for each sample is costly and leads to a bloated signature set. For this, I set up three MySQL tables.

    binaries - stores information about each sample seen
    | Field | Type        | Null | Key | Default | Extra          |
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |

    functions - stores information about each function seen
    | Field | Type        | Null | Key | Default | Extra          |
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |

    link_table - associates each binary with a set of functions
    | Field   | Type    | Null | Key | Default | Extra |
    | prog_id | int(11) | NO   | PRI | NULL    |       |
    | fn_id   | int(11) | NO   | PRI | NULL    |       |

The table binaries stores a hash of each program, a unique id, and the program's size. The table functions stores the md5sum of the bytes comprising the function, a unique id, and the size of the function. The table link_table links each binary to the functions it contains. The grouping is done based on common functions between binaries.

In order to populate these tables I wrote an IDAPython script. It iterates through the functions of the program, calculates their md5sum, and then inserts that information into the functions table if its length is greater than 19. The value 19 was selected after some light analysis in order to filter out functions that only consisted of a few instructions. Here is the snippet that populates functions and link_table.

    # for all function offsets
for fn_ea in Functions():
if fn_ea == None:

        # get function from offset
        f = idaapi.get_func(fn_ea)

        # get function bytes
        start = f.startEA
        size = f.endEA - start
        bytes = GetManyBytes(start, size)

        # if the function is sufficiently long
        if bytes != None and len(bytes) > 19:
            fn_hash = md5(bytes).hexdigest().upper()
            fn_size = str(len(bytes))
            fn_data = (fn_hash, fn_size)

            # get function id, or insert and get function id
            fn_id = get_fn_id(cursor, cnx, fn_data)

            # link binary to function
            link_query = 'REPLACE INTO link_table (prog_id, fn_id) VALUES (%s, %s)'
            link_data = (prog_id, fn_id)
            cursor.execute(link_query, link_data)

IDA and this script are called by a batch script for every target binary. Once these tables are populated another script is ran, This script uses the MySQL functionality group_concat to group binaries, based on their common functions, into a list. The problem with this approach is that if binaries A, B, and C share functions x, y, and z, and binaries A and C share functions w, x, y, and z, then we will have duplicates in the list returned. To remedy this problem the script simply loops through the rows returned and if any list of binaries is completely contained in another list, it is removed. Any binary not in these groupings is marked to get its own signature.

Next, the md5sums of the functions common to each group are added to the table communicate. This was the best way for me to pass this information between scripts. Once this table is populated, another IDAPython script is called on the first binary in a group. This script iterates through the functions in the binary and if the function's md5sum matches one in the list of shared functions, its basic blocks are loaded into a table basic_blocks. This table stores the parent function's md5sum, the bytes that comprise the basic block, the basic block's md5sum, the size of the basic block, and its entropy. The byte_ prefix is used to differentiate between attributes of the raw data and the hex encoded version used in the ClamAV signatures.

    communicate - used to pass the md5s of common functions
    | Field  | Type        | Null | Key | Default | Extra |
    | fn_md5 | varchar(32) | NO   | PRI | NULL    |       |

    basic_blocks - stores basic block information from functions

    | Field        | Type        | Null | Key | Default | Extra |
    | fn_md5       | varchar(32) | NO   | PRI | NULL    |       |
    | hex_bytes    | mediumtext  | NO   |     | NULL    |       |
    | bb_md5       | varchar(32) | NO   | PRI | NULL    |       |
    | byte_size    | int(11)     | NO   |     | NULL    |       |
    | byte_entropy | double      | NO   |     | NULL    |       |

Once the basic blocks are stored, the IDAPython script completes and returns the the signature generation script. The basic blocks are queried for, sorted by their parent function and a metric entropy * size. The script then iterates through the functions and selects the best basic block based on the previously mentioned metric. It continues to do this until it has a sufficient amount of bytes. It then constructs an LDB signature.

With my newly created signatures, I ran a test on all the samples I had downloaded.

----------- SCAN SUMMARY -----------
Known viruses: 107
Engine version: 0.98.1
Scanned directories: 1
Scanned files: 239
Infected files: 190

Data scanned: 78.35 MB
Data read: 81.36 MB (ratio 0.96:1)
Time: 2.332 sec (0 m 2 s)

The interesting lines are highlighted. Since this script should give near total coverage, a detection rate of 190/239, while impressive, did not meet my expectations. Something was amiss. My colleague Shaun Hurley noticed that 64 bit Mach-O files were being neglected. Thinking about it, this made sense. IDA has different versions for 32 bit and 64 bit files. I modified the scripts to use idaw64.exe and reran them on the 64 bit binaries. The combined signature set was more impressive.

----------- SCAN SUMMARY -----------
Known viruses: 155
Engine version: 0.98.1
Scanned directories: 1
Scanned files: 239
Infected files: 232

Data scanned: 78.82 MB
Data read: 81.36 MB (ratio 0.97:1)
Time: 2.535 sec (0 m 2 s)

Great success!

This method does have some drawbacks. Since I was running it in a VM, concerns about hard disk space influenced the choice to group based on functions rather than grouping based on basic blocks. This will be fixed by offloading MySQL to a more dedicated machine whose hard drive I can fill up. As well, only common functions between the binaries are considered when selecting basic blocks. This was an oversight on my part since other functions may not be exact matches but could share a lot of common code. With the extra database space, I do not think grouping based on basic blocks is an unreasonable task for these relatively small sets of samples. Building in automatic identification of 32 bit and 64 bit files would remove some manual effort from the process.

A good example of a signature generated for multiple samples is this one for Flashback:


While that signature is just extracted x86, it alerts on the following 15 samples:


Overall, I'm very happy with these results. Since IDAPro is used to extract everything, this work will translate well to the other binary types that IDA is capable of parsing - most importantly, portable executables.