Friday, March 28, 2014

ClamAV 0.95 Engine End of Life Announcement

ClamAV Community,

This notice is to inform you that effective immediately ClamAV 0.95 (and all minor versions) is no longer supported in accordance with ClamAV's EOL policy which can be found here:

https://github.com/vrtadmin/clamav-faq/blob/master/faq/faq-eol.md

While the current CVD's being distributed will still work on ClamAV 0.95, and we are not enabling the functionality to actually make those versions be able to update, this does serve as notice that we are no longer going to be testing against that version in our regression tests.

We will also be EOL'ing 0.96 in coming months, so if either of those versions is currently in use, it is highly suggested that you upgrade to the most current version.

Thank you for using ClamAV!

Wednesday, March 5, 2014

Programmatic Boolean Simplification and ClamAV Signatures

ClamAV uses boolean logic in its LDB signatures. Each LDB signature has a set of subsignatures that, when present, evaluate to True in its logical statement. When the logical statement evaluates to True, the signature alerts. My goal was to create something to clean up the logical statement in these signatures.

Logical & (and) and logical | (or) are the primary operators in ClamAV LDB signatures. An example representative of most LDB logical blocks would look something like 0&(1|2). This would indicate that for this virus, subsignature 0 must be present, and one of subsignature 1 or subsignature 2 must also be present for an alert to be triggered. ClamAV does allow you to specify how many times a subsignature can alert with the operators =, <, and >. For example, 0=0&1 would indicate that subsignature 0 should occur zero times in a file and subsignature 1 must be present. These conditional statements (example, 0>3) are treated in the same manner as a single subsignature (example, 0) in that they evaluate to True or not. This simplifies our problem because negations (N=0) are treated as positive matches and can simply be substituted with a dummy variable. For a more detailed description of ClamAV LDB signatures, please refer to signatures.pdf.

My goal starting out was to take a boolean equation and represent it in a form with minimal bytes. This means that unnecessary logic would need to be eliminated, as well, any boolean identity to reduce the number of bytes used should be applied. Some examples:

    Original              Reduced
    0|(0&1)               0 
    (0&1)|(0&2)           0&(1|2)

Over some months, I kept seeing cases where this capability would be useful. For example, if you had some set of subsignatures and some set of files, automatically construct an LDB signature that alerts on all those files. Another situation would be when LDB signatures are submitted. With an algorithm like this, a tool to could be written to automatically verify that the logical block is in a minimal representation. Signature verification and improvement is what I will focus on in this post.

Algorithms

With the utility for this growing, I began. The biggest difficulty I had starting out was that I had two assumptions. First, that there would be a single algorithm able to achieve this. That was incorrect. As well, I thought that the method for doing this would be available online. That was also incorrect.

The first step was to parse the logical block into something that is more easily manipulated programmatically. For this I implemented the Shunting Yard Algorithm to parse the equation into polish notation. Instead of rebuilding a prefix string, the logical block is translated into an expression tree. This program was implemented in Python, so the expression tree consisted of dicts and lists. For example:

    Statement:
    0&1&(2|3)

    Python:
    {'&':[0, {'&': [1, {'|': [2, 3]}]}]}

Once the expression tree is built, it is converted to disjunctive normal form. This makes the statement an "or (|) of ands (&)". If that description is hard to digest, it can be thought of as the boolean equivalent of a sum of products. This form is reached by recursively applying the boolean identity for distribution in order to expand the equation. The algorithm for this is to identify a the situation where an and (&) sub-tree has an or (|) child. You then expand and distribute the children. The difficult thing to figure out with this part was that once a branch was changed, the algorithm must be run again on that branch until it stops changing. An example of the input and output:

    Statement:
    0&(1|0)&2

    Distributed:
    (0&1&2)|(0&0&2)

Any duplicate and (&) sub-statement is eliminated. This prepares the logical block for the Quine-McCluskey algorithm. This algorithm is the programmatic equivalent of a Karnaugh Map. That is, it will minimize a boolean expression. However, it will only do this minimization in a logical sense. In terms of representation, the equation will likely still not be optimized. In the Quine-McCluskey algorithm, a truth table is constructed. This was represented by a list of lists in Python. Using the above example, we would get:

    Truth table:
    | 0 1 2 |
    | ===== |
    | 1 1 1 |    // 0&1&2 
    | 1 0 1 |    // 0&2

    Python:
    [[1,1,1], [1,0,1]]

Already, the redundant 0&0 has been reduced to a single 0 in the second statement. The algorithm then sorts the truth table entries based on their count of 1s. Any two terms that differ only by one digit can have that digit marked with a dash (-) to indicate that the term does not matter. For instance:
    Truth table:
    | 0 1 2 |
    | ===== |
    | 1 - 1 |
    | 1 - 1 |
I used 2 in place of a dash, since this was a list of integers it would keep processing simple.
    Python:
    [[1,2,1], [1,2,1]]
This shows that we can actually eliminate subsignature 1. Since the full signature will fire with or without that subsignature, it is not necessary to keep. Since this simple example does not demonstrate it, I will skip over the selection portion of Quine-McCluskey. The full algorithm is detailed on Wikipedia (linked above). We now have deleted subsignature 1 and have changed subsignature 2 to 1, leaving us with:

    Result: 
    0&1 

Now that we have a logically efficient equation we must minimize it. For this, the script recursively applies boolean identity for distribution. However, this time, it applies it in a way that extracts common terms from statements. If the current key is or (|), the function finds the most common value in this branch at a depth of 1. This means that it will return if it reaches another or (|) after passing an and (&).

The previous case is completely reduced, so I will use a different example here.

    Statement:
    (0&1&2)|(0&1&3)

                 |
             /       \
            &         &
          /   \     /   \
         &     0   0     &
       /   \           /   \
      3     1         1     2

It would start at the root node (|), then find the most common value in its subtree. Both 0 and 1 occur twice in the subtree, so it selects whichever it saw first. It then extracts that value and rebuilds the tree.

              &
          /       \
         1         |
               /       \
              &         &
            /   \     /   \
           3     0   0     2 

It continues to do this until no more values can be extracted.

    Result:
    0&1&(2|3)


This algorithm was not producing the results I wanted on larger equations until I added the capability for it to select subtrees as the most common value in addition to integers (leaf nodes).

The program then rebuilds the logical statement in prefix notation from the expression tree. It checks the difference in length between the original and simplified signature. If there is an improvement, the original and simplified equations are converted to a form that Z3 Python can read and equivalence is verified. Z3 has a slightly different syntax, but the biggest consideration is that ClamAV's method of representing variables as number is incompatible, so each number must be converted to a letter.

    slvr.add(Not(simp_z3 == orig_z3))
    
    if slvr.check() == unsat:
        if opt_demo:
            print 'Equivalence proven!'
        return True
    else:
        if opt_demo:
            print 'Not equivalent!'
            print 'Counterexample:'
            print s.model()
        return False

You ask Z3 to prove that the simplified equation does not equal the original equation. If this request is unsatisfiable, then you have proven the two equations are equivalent. Once equivalence is verified, the LDB signature is reconstructed with the improved logic and printed out.

Examples


This first example demonstrates the scripts ability to extract common values and reduce a signature via the distribution identity.

    ====================
    Test.Signature;Engine:51-255,Target:0;(0&2&3&4)|(1&2&3&4);41414141;42424242;43434343;45454545;46464646
    
    ORIG:  (0&2&3&4)|(1&2&3&4)
    SIMP:  (0|1)&2&3&4
    Equivalence proven!
    Reduced size by 8 bytes.
    
    Test.Signature;Engine:51-255,Target:0;(0|1)&2&3&4;41414141;42424242;43434343;45454545;46464646
    ====================

In this example the script combines terms to simplify the final signature.

    ====================
    Test.Signature;Engine:51-255,Target:0;0&(1|2)&((3&(5|6))|(4&(5|6)));41414141;42424242;43434343;45454545;46464646;47474747;48484848
    
    ORIG:  0&(1|2)&((3&(5|6))|(4&(5|6)))
    SIMP:  0&(1|2)&(3|4)&(5|6)
    Equivalence proven!
    Reduced size by 10 bytes.

    Test.Signature;Engine:51-255,Target:0;0&(1|2)&(3|4)&(5|6);41414141;42424242;43434343;45454545;46464646;47474747;48484848
    ====================

A simple situation would be eliminating redundant logic.

    ====================
    Test.Signature;Engine:51-255,Target:0;((0&1)|(1&0));41414141;42424242
    
    ORIG:  ((0&1)|(1&0))
    SIMP:  0&1
    Equivalence proven!
    Reduced size by 10 bytes.
    
    Test.Signature;Engine:51-255,Target:0;0&1;41414141;42424242
    ====================

As well, there are situations where the logic can make a subsignature unnecessary. When a subsignature is deleted, there is a little bookkeeping that needs to be done to make sure Z3 still verifies it. When building the Z3 equations it is important to note that subsignature 1 in the simplified equation is subsignature 2 in the original.

    ====================
    Test.Signature;Engine:51-255,Target:0;0&(1|0)&2;41414141;42424242;43434343
 
    Removing subsig 1 

    ORIG:  0&(1|0)&2
    SIMP:  0&1
    Equivalence proven!
    Reduced size by 15 bytes.
    Test.Signature;Engine:51-255,Target:0;0&1;41414141;43434343
    ====================

ClamAV Signature Set

The next question is what happens if we run this against the ClamAV LDB signature set. As of writing this, there are currently 615 LDB signatures in daily.cld. If this tool were to have been used on all the LDB signatures prior to publishing, it would have saved 712 bytes. A large portion of these savings comes from parentheses around the outside of the equation that are dropped. The tool would take something like (0&1) and produce 0&1.

If we were to update all the signatures with this tool there is an additional factor to consider. The first time a signature is updated a number is appended to the end of the signature name. For example, if you were to update the signature Win.Trojan.Agent it would be renamed Win.Trojan.Agent-1. That is a cost of 2 bytes. Any additional update increments the number on the end of the signature name. If the signature ended in -9, -99, etc., then there would be a cost of one byte when it is updated. Considering this, an update to the signature database would save 446 bytes. The fact that this is a relatively low number is reassuring to me - it means that we are writing pretty good signatures.

Conclusion

This problem was a lot of fun to solve since the solution was not available anywhere online. In future research, having this functionality will enable me to generate robust LDBs to cover clusters of malware with a single signature. On a functional level, this tool will save space and improve the overall quality of signatures published in the future. Thanks for reading.

Friday, February 21, 2014

Introducing OpenSSL as a dependency to ClamAV

In an upcoming release, we are planning on introducing OpenSSL as a dependency to ClamAV.  We wanted to get this out to the community for any feedback that could be provided in order for everyone to understand why we are doing it.  So first, I'll cover a few reasons we are planning to introduce it, then outline some Pros and Cons:

  1. Performance. OpenSSL has code optimized for many platforms. In several tests that we've performed, we've averaged a 70% increase in performance.
  2. OpenSSL’s code has had a lot of eyes on it. Cryptography is hard to get right.
  3. Planned future work depends on it.
Pros for OpenSSL:

  1. Industry-standard cryptography code
  2. Many, many eyes have looked over OpenSSL’s code.
  3. It’s used pretty much everywhere.
  4. We will be able to provide a better freshclam experience in a future release.
  5. PERFORMANCE
  6. Portability. OpenSSL works pretty much everywhere.
  7. Maintainability. With OpenSSL backing major infrastructure, operating systems provide quick patches/updates to OpenSSL.

Cons for OpenSSL:

  1. Possibly bigger memory footprint
  2. First required dependency for ClamAV’s engine
As always we are receptive to feedback from the community.  It is always welcome over on the ClamAV-Users list: http://www.clamav.net/lang/en/ml/

Wednesday, February 19, 2014

Open Source Community Meeting next week at RSA!

After a lot of hard work by our teams, and with RSA just a few days away, we are proud to announce that along with Cisco and Sourcefire's corporate teams being present at RSA, and for the first time we will also be holding an Open Source Community Meeting!

Matt Watchinski (Director of the Vulnerability Research Team) and myself, Joel Esler, (Open Source Manager) will be presenting on the state of our Open Source projects at Sourcefire, the state of Open Source now that we are Cisco,  some future developments and of course, open Q&A!

So here's some attendance details:

Open Source Community Meeting
Executive Conference Center
55 4th Street -- Level 2
San Francisco, CA 94103

Wednesday, February 26th, 2014
12:00pm - 2:00pm

Lunch will be provided on site.

We also have some exclusive Swag give-aways that not only no one else has, but aren't available anywhere else!  Available for the first 40 people that come through the door (if we have your size).

We'll have availability for about 50 people on site, so first come, first served, let's make this a repeating event!

We look forward to seeing you there!

Tuesday, February 18, 2014

Follow Up: Generating ClamAV Signatures with IDAPython and MySQL

I recently wrote a blog post Generating ClamAV Signatures with IDAPython and MySQL. In the comments, I was asked for more details on how the script generate_sigs.py groups binaries by functions.

The three tables used were shared in the previous post, but for convenience, here they are again.

    binaries - stores information about each sample seen
    +-------+-------------+------+-----+---------+----------------+
    | Field | Type        | Null | Key | Default | Extra          |
    +-------+-------------+------+-----+---------+----------------+
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |
    +-------+-------------+------+-----+---------+----------------+

    functions - stores information about each function seen
    +-------+-------------+------+-----+---------+----------------+
    | Field | Type        | Null | Key | Default | Extra          |
    +-------+-------------+------+-----+---------+----------------+
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |
    +-------+-------------+------+-----+---------+----------------+

    link_table - associates each binary with a set of functions
    +---------+---------+------+-----+---------+-------+
    | Field   | Type    | Null | Key | Default | Extra |
    +---------+---------+------+-----+---------+-------+
    | prog_id | int(11) | NO   | PRI | NULL    |       |
    | fn_id   | int(11) | NO   | PRI | NULL    |       |
    +---------+---------+------+-----+---------+-------+

To get the groups of binaries you find binaries that share a number of common functions. I'll build out the MySQL query so it is understandable. The inner query is here:

    SELECT fn_id,
        group_concat(prog_id
            ORDER BY prog_id) AS bn_list,     # get a list of binaries
        count(*) AS pcnt                      # count the binaries in that list
    FROM link_table
    GROUP BY fn_id HAVING pcnt > 2            # filter the results
    ORDER BY bn_list;


This gives a list of functions and their associated binaries if more than 2 binaries are associated with that function.

    +-------+----------------------------------------------------------+------+
    | fn_id | bn_list                                                  | pcnt |
    +-------+----------------------------------------------------------+------+
    |   993 | 10,16,63,74,76,87,92,93,124,126,129,135,145              |   13 |
    |   994 | 10,16,63,74,76,87,92,93,124,126,129,135,145              |   13 |
    |   995 | 10,16,63,74,76,87,92,93,124,126,129,135,145              |   13 |
    |  1021 | 11,15,28,77,86,91,136                                    |    7 |
    |  1116 | 11,15,28,86,136                                          |    5 |
    |  1258 | 12,20,22,127                                             |    4 |
    |  1118 | 12,22,127                                                |    3 |
    |  1364 | 14,24,140                                                |    3 |
    |  1434 | 18,59,68,71,73,83,84,110,119,120,137,138,148,150,154,157 |   16 |
    |  1425 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1426 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1427 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1428 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1429 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1430 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1436 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    ...

This list is quite long so I've truncated it. An example result, function 1425 has 13 binaries associated with it, those binaries' ids are listed. That's great, but we really want a list of binaries and a list of functions that associate those binaries. So, we now embed the original query in a similar query that creates a list of functions grouped by the bn_list field.

    SELECT bn_list,
        group_concat(fn_id
            ORDER BY fn_id) AS fn_list      # get a list of functions for each bn_list
    FROM
        (SELECT fn_id,
            group_concat(prog_id
                ORDER BY prog_id) AS bn_list,
            count(*) AS pcnt
        FROM link_table
        GROUP BY fn_id HAVING pcnt > 1
        ORDER BY bn_list) AS t
    GROUP BY bn_list HAVING count(*) > 4;    # get groups connected by > 4 functions   

I also added count(*) < 23 to the last line of this query to get readable output. The resulting table is split and truncated below. Each row in bn_list corresponds to the same row in fn_list.

    +--------------------------------------------------------+
    | bn_list                                                |
    +--------------------------------------------------------+
    | 121,131                                                |
    | 18,68,71,83,84,110,119,120,138,148,150,154,157,167,182 |
    | 18,84,119,138,150,157,182                              |
    | 19,81,115,173                                          |
    | 26,95,142,146,165,183                                  |
    | 27,128                                                 |
    | 27,37,53,70,79,172                                     |
    | 30,64,100                                              |
    | 48,50,69,147,168                                       |
    | 59,73                                                  |
    | 96,105,181                                             |
    | 96,181                                                 |
    +--------------------------------------------------------+
    +--------------------------------------------------------+
    | fn_list                                                |
    +--------------------------------------------------------+
    | 37061,37062,37063,37064,37065,37066,37067,37068        |
    | 1425,1426,1427,1428,1429,1430,1436                     |
    | 1419,1420,1421,1422,1423,1424,1431,1432,1433,1435      |
    | 1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,...  |
    | 4359,4360,4361,4362,4363                               |
    | 4572,4576,4577,4580,4634,4635,4644                     |
    | 4482,4483,4559,4560,4608,4622,4623,4624                |
    | 4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,...  |
    | 12054,12086,12087,12102,12103,12105,12108,12109,...    |
    | 21291,21292,21293,21294,21295,21296,21297,21301,...    |
    | 31659,31661,31665,31671,31673                          |
    | 31651,31653,31655,31657,31663,31667,31669,31685,31687  |
    +--------------------------------------------------------+

This leaves us with a list of binaries grouped by 4 or more functions. This isn't perfect for creating signatures because some lines are contained completely in other lines. For example:

    +--------------------------------------------------------+
    | bn_list                                                |
    +--------------------------------------------------------+
    | 18,68,71,83,84,110,119,120,138,148,150,154,157,167,182 |
    | 18,84,119,138,150,157,182                              |
    +--------------------------------------------------------+
    | fn_list                                                |
    +--------------------------------------------------------+
    | 1425,1426,1427,1428,1429,1430,1436                     |
    | 1419,1420,1421,1422,1423,1424,1431,1432,1433,1435      |

Since one entry's binary list is a subset of the other entry's binary list, we can delete the shorter list and avoid having largely duplicate functionality. This is done programmatically once the query is returned to the Python script. I hope this fills in any blanks on grouping the functions. After this, another script is called to extract basic blocks from common functions and generate a signature with those bytes. The original post goes into a bit more detail on that, so I will end here.

Monday, February 17, 2014

Introducing ClamAV community signatures

I am pleased to announce the creation of a new ClamAV signatures contribution program. My name is Alain Zidouemba and I will be managing this program.

If you would like to submit a ClamAV signature, you may do so by emailing community-sigs [at] lists [dot]  clamav [dot] net. We will require that each signature:

- not be a hash-based signature
- be accompanied by a MD5/SHA1/SHA256 for a sample the signature is meant to detect.
- come with a brief description of the threat the signature is trying to detect and what the signature is looking for

Please DO NOT attach malware to your email. Instead, submit your sample here

Signatures submitted will be tweaked if necessary in order to conform to our standards. After the signature passes quality assurance testing, it will be released with proper attribution unless you prefer to remain anonymous.

You can subscribe to the mailing list here. More information about this program will be added in the FAQ in a few days.

We look forward to a fruitful collaboration on community-sigs [at] lists [dot] clamav [dot] net.

Wednesday, February 12, 2014

Generating ClamAV Signatures with IDAPython and MySQL

Covering malware is a constant fight and the more automation you can integrate, the easier life becomes. This post will go over a relatively easy setup for generating ClamAV signatures based on a set of samples.

I chose to work with OSX malware, specifically targeting Mach-O files. This would give me a relatively small sample set to work with. I downloaded the files from VirusTotal using the search type:macho positives:5+. At the time of download, this yielded 239 samples.

The first problem was grouping samples. Grouping the samples would allow to generate a single signature for multiple samples. One signature for each sample is costly and leads to a bloated signature set. For this, I set up three MySQL tables.

    binaries - stores information about each sample seen
    +-------+-------------+------+-----+---------+----------------+
    | Field | Type        | Null | Key | Default | Extra          |
    +-------+-------------+------+-----+---------+----------------+
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |
    +-------+-------------+------+-----+---------+----------------+

    functions - stores information about each function seen
    +-------+-------------+------+-----+---------+----------------+
    | Field | Type        | Null | Key | Default | Extra          |
    +-------+-------------+------+-----+---------+----------------+
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |
    +-------+-------------+------+-----+---------+----------------+

    link_table - associates each binary with a set of functions
    +---------+---------+------+-----+---------+-------+
    | Field   | Type    | Null | Key | Default | Extra |
    +---------+---------+------+-----+---------+-------+
    | prog_id | int(11) | NO   | PRI | NULL    |       |
    | fn_id   | int(11) | NO   | PRI | NULL    |       |
    +---------+---------+------+-----+---------+-------+


The table binaries stores a hash of each program, a unique id, and the program's size. The table functions stores the md5sum of the bytes comprising the function, a unique id, and the size of the function. The table link_table links each binary to the functions it contains. The grouping is done based on common functions between binaries.

In order to populate these tables I wrote an IDAPython script. It iterates through the functions of the program, calculates their md5sum, and then inserts that information into the functions table if its length is greater than 19. The value 19 was selected after some light analysis in order to filter out functions that only consisted of a few instructions. Here is the snippet that populates functions and link_table.

    # for all function offsets
    
for fn_ea in Functions():
        
if fn_ea == None:
           
continue

        # get function from offset
        f = idaapi.get_func(fn_ea)

        # get function bytes
        start = f.startEA
        size = f.endEA - start
        bytes = GetManyBytes(start, size)

        # if the function is sufficiently long
        if bytes != None and len(bytes) > 19:
            fn_hash = md5(bytes).hexdigest().upper()
            fn_size = str(len(bytes))
            fn_data = (fn_hash, fn_size)

            # get function id, or insert and get function id
            fn_id = get_fn_id(cursor, cnx, fn_data)

            # link binary to function
            link_query = 'REPLACE INTO link_table (prog_id, fn_id) VALUES (%s, %s)'
            link_data = (prog_id, fn_id)
            cursor.execute(link_query, link_data)
            cnx.commit().


IDA and this script are called by a batch script for every target binary. Once these tables are populated another script is ran, generate_sigs.py. This script uses the MySQL functionality group_concat to group binaries, based on their common functions, into a list. The problem with this approach is that if binaries A, B, and C share functions x, y, and z, and binaries A and C share functions w, x, y, and z, then we will have duplicates in the list returned. To remedy this problem the script simply loops through the rows returned and if any list of binaries is completely contained in another list, it is removed. Any binary not in these groupings is marked to get its own signature.

Next, the md5sums of the functions common to each group are added to the table communicate. This was the best way for me to pass this information between scripts. Once this table is populated, another IDAPython script is called on the first binary in a group. This script iterates through the functions in the binary and if the function's md5sum matches one in the list of shared functions, its basic blocks are loaded into a table basic_blocks. This table stores the parent function's md5sum, the bytes that comprise the basic block, the basic block's md5sum, the size of the basic block, and its entropy. The byte_ prefix is used to differentiate between attributes of the raw data and the hex encoded version used in the ClamAV signatures.

    communicate - used to pass the md5s of common functions
    +--------+-------------+------+-----+---------+-------+
    | Field  | Type        | Null | Key | Default | Extra |
    +--------+-------------+------+-----+---------+-------+
    | fn_md5 | varchar(32) | NO   | PRI | NULL    |       |
    +--------+-------------+------+-----+---------+-------+

    basic_blocks - stores basic block information from functions

    +--------------+-------------+------+-----+---------+-------+
    | Field        | Type        | Null | Key | Default | Extra |
    +--------------+-------------+------+-----+---------+-------+
    | fn_md5       | varchar(32) | NO   | PRI | NULL    |       |
    | hex_bytes    | mediumtext  | NO   |     | NULL    |       |
    | bb_md5       | varchar(32) | NO   | PRI | NULL    |       |
    | byte_size    | int(11)     | NO   |     | NULL    |       |
    | byte_entropy | double      | NO   |     | NULL    |       |
    +--------------+-------------+------+-----+---------+-------+


Once the basic blocks are stored, the IDAPython script completes and returns the the signature generation script. The basic blocks are queried for, sorted by their parent function and a metric entropy * size. The script then iterates through the functions and selects the best basic block based on the previously mentioned metric. It continues to do this until it has a sufficient amount of bytes. It then constructs an LDB signature.

With my newly created signatures, I ran a test on all the samples I had downloaded.

----------- SCAN SUMMARY -----------
Known viruses: 107
Engine version: 0.98.1
Scanned directories: 1
Scanned files: 239
Infected files: 190

Data scanned: 78.35 MB
Data read: 81.36 MB (ratio 0.96:1)
Time: 2.332 sec (0 m 2 s)


The interesting lines are highlighted. Since this script should give near total coverage, a detection rate of 190/239, while impressive, did not meet my expectations. Something was amiss. My colleague Shaun Hurley noticed that 64 bit Mach-O files were being neglected. Thinking about it, this made sense. IDA has different versions for 32 bit and 64 bit files. I modified the scripts to use idaw64.exe and reran them on the 64 bit binaries. The combined signature set was more impressive.

----------- SCAN SUMMARY -----------
Known viruses: 155
Engine version: 0.98.1
Scanned directories: 1
Scanned files: 239
Infected files: 232

Data scanned: 78.82 MB
Data read: 81.36 MB (ratio 0.97:1)
Time: 2.535 sec (0 m 2 s)


Great success!

This method does have some drawbacks. Since I was running it in a VM, concerns about hard disk space influenced the choice to group based on functions rather than grouping based on basic blocks. This will be fixed by offloading MySQL to a more dedicated machine whose hard drive I can fill up. As well, only common functions between the binaries are considered when selecting basic blocks. This was an oversight on my part since other functions may not be exact matches but could share a lot of common code. With the extra database space, I do not think grouping based on basic blocks is an unreasonable task for these relatively small sets of samples. Building in automatic identification of 32 bit and 64 bit files would remove some manual effort from the process.

A good example of a signature generated for multiple samples is this one for Flashback:

Osx.Trojan.Flashback-16;Engine:51-255,Target:9;0&1&2&3&4&5&6&7&8;5531C089E58B550885D2740B;5589E583EC28895DF48B5D088975F88B750C897DFC8B7D1085DB0F94C285F60F94C031C908C20F858A000000;5589E58B450885C0740B;5589E583EC18C744240801000000C744240414000000C7042400000000E83F020000C74008000000;5589E557565383EC2C8B45100FB7550C8B5D188945E08B451489542404895C24088945E48B451C89;8D4208894424088B450C89142489442404E818FFFFFF0FBEC0;5589E557565383EC1C8B7D0885FF7432;8B450C894210B801000000;C744240801000000C74424040C000000C7042400000000E877010000893089C28B073B03751E

While that signature is just extracted x86, it alerts on the following 15 samples:

B5942F202930DAFF45C79BDC7871C088
548981EF3FCB813FCD3ED2EBAB8102D7
C067B84DC59C93C1363FD9FC56CD2918
B0199B369A3FCC71653ED8A9F7990AFC
4E855DD770680F80A30B9805262BBEE6
EF2DB2EEB040BDF1D0A9A18F2775149B
9272778BB6FBC00131FFCECE51388ACB
BE1B0DB89A4798E6C11E4EBFB6B479AE
CED7C97304BFFD932822565E99460213
B94BF524A537C02DDA4CD047F61E00C4
14DE914B0101C0E7A2C7CF521557E747
657E5A48CEC24F0C6F516CA55581550F
647AF7013D0DA77B6E74D3C692B1B6C3
84352BF4A2FA95FC51AD0781000AA864
93734AEBC1670C22A79F08D1A0FCBD8F


Overall, I'm very happy with these results. Since IDAPro is used to extract everything, this work will translate well to the other binary types that IDA is capable of parsing - most importantly, portable executables.