Friday, February 21, 2014

In an upcoming release, we are planning on introducing OpenSSL as a dependency to ClamAV.  We wanted to get this out to the community for any feedback that could be provided in order for everyone to understand why we are doing it.  So first, I'll cover a few reasons we are planning to introduce it, then outline some Pros and Cons:

  1. Performance. OpenSSL has code optimized for many platforms. In several tests that we've performed, we've averaged a 70% increase in performance.
  2. OpenSSL’s code has had a lot of eyes on it. Cryptography is hard to get right.
  3. Planned future work depends on it.
Pros for OpenSSL:

  1. Industry-standard cryptography code
  2. Many, many eyes have looked over OpenSSL’s code.
  3. It’s used pretty much everywhere.
  4. We will be able to provide a better freshclam experience in a future release.
  5. PERFORMANCE
  6. Portability. OpenSSL works pretty much everywhere.
  7. Maintainability. With OpenSSL backing major infrastructure, operating systems provide quick patches/updates to OpenSSL.

Cons for OpenSSL:

  1. Possibly bigger memory footprint
  2. First required dependency for ClamAV’s engine
As always we are receptive to feedback from the community.  It is always welcome over on the ClamAV-Users list: http://www.clamav.net/lang/en/ml/

Wednesday, February 19, 2014

After a lot of hard work by our teams, and with RSA just a few days away, we are proud to announce that along with Cisco and Sourcefire's corporate teams being present at RSA, and for the first time we will also be holding an Open Source Community Meeting!

Matt Watchinski (Director of the Vulnerability Research Team) and myself, Joel Esler, (Open Source Manager) will be presenting on the state of our Open Source projects at Sourcefire, the state of Open Source now that we are Cisco,  some future developments and of course, open Q&A!

So here's some attendance details:

Open Source Community Meeting
Executive Conference Center
55 4th Street -- Level 2
San Francisco, CA 94103

Wednesday, February 26th, 2014
12:00pm - 2:00pm

Lunch will be provided on site.

We also have some exclusive Swag give-aways that not only no one else has, but aren't available anywhere else!  Available for the first 40 people that come through the door (if we have your size).

We'll have availability for about 50 people on site, so first come, first served, let's make this a repeating event!

We look forward to seeing you there!

Tuesday, February 18, 2014

I recently wrote a blog post Generating ClamAV Signatures with IDAPython and MySQL. In the comments, I was asked for more details on how the script generate_sigs.py groups binaries by functions.

The three tables used were shared in the previous post, but for convenience, here they are again.

    binaries - stores information about each sample seen
    +-------+-------------+------+-----+---------+----------------+
    | Field | Type        | Null | Key | Default | Extra          |
    +-------+-------------+------+-----+---------+----------------+
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |
    +-------+-------------+------+-----+---------+----------------+

    functions - stores information about each function seen
    +-------+-------------+------+-----+---------+----------------+
    | Field | Type        | Null | Key | Default | Extra          |
    +-------+-------------+------+-----+---------+----------------+
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |
    +-------+-------------+------+-----+---------+----------------+

    link_table - associates each binary with a set of functions
    +---------+---------+------+-----+---------+-------+
    | Field   | Type    | Null | Key | Default | Extra |
    +---------+---------+------+-----+---------+-------+
    | prog_id | int(11) | NO   | PRI | NULL    |       |
    | fn_id   | int(11) | NO   | PRI | NULL    |       |
    +---------+---------+------+-----+---------+-------+

To get the groups of binaries you find binaries that share a number of common functions. I'll build out the MySQL query so it is understandable. The inner query is here:

    SELECT fn_id,
        group_concat(prog_id
            ORDER BY prog_id) AS bn_list,     # get a list of binaries
        count(*) AS pcnt                      # count the binaries in that list
    FROM link_table
    GROUP BY fn_id HAVING pcnt > 2            # filter the results
    ORDER BY bn_list;


This gives a list of functions and their associated binaries if more than 2 binaries are associated with that function.

    +-------+----------------------------------------------------------+------+
    | fn_id | bn_list                                                  | pcnt |
    +-------+----------------------------------------------------------+------+
    |   993 | 10,16,63,74,76,87,92,93,124,126,129,135,145              |   13 |
    |   994 | 10,16,63,74,76,87,92,93,124,126,129,135,145              |   13 |
    |   995 | 10,16,63,74,76,87,92,93,124,126,129,135,145              |   13 |
    |  1021 | 11,15,28,77,86,91,136                                    |    7 |
    |  1116 | 11,15,28,86,136                                          |    5 |
    |  1258 | 12,20,22,127                                             |    4 |
    |  1118 | 12,22,127                                                |    3 |
    |  1364 | 14,24,140                                                |    3 |
    |  1434 | 18,59,68,71,73,83,84,110,119,120,137,138,148,150,154,157 |   16 |
    |  1425 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1426 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1427 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1428 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1429 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1430 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    |  1436 | 18,68,71,83,84,110,119,120,138,148,150,154,157           |   13 |
    ...

This list is quite long so I've truncated it. An example result, function 1425 has 13 binaries associated with it, those binaries' ids are listed. That's great, but we really want a list of binaries and a list of functions that associate those binaries. So, we now embed the original query in a similar query that creates a list of functions grouped by the bn_list field.

    SELECT bn_list,
        group_concat(fn_id
            ORDER BY fn_id) AS fn_list      # get a list of functions for each bn_list
    FROM
        (SELECT fn_id,
            group_concat(prog_id
                ORDER BY prog_id) AS bn_list,
            count(*) AS pcnt
        FROM link_table
        GROUP BY fn_id HAVING pcnt > 1
        ORDER BY bn_list) AS t
    GROUP BY bn_list HAVING count(*) > 4;    # get groups connected by > 4 functions   

I also added count(*) < 23 to the last line of this query to get readable output. The resulting table is split and truncated below. Each row in bn_list corresponds to the same row in fn_list.

    +--------------------------------------------------------+
    | bn_list                                                |
    +--------------------------------------------------------+
    | 121,131                                                |
    | 18,68,71,83,84,110,119,120,138,148,150,154,157,167,182 |
    | 18,84,119,138,150,157,182                              |
    | 19,81,115,173                                          |
    | 26,95,142,146,165,183                                  |
    | 27,128                                                 |
    | 27,37,53,70,79,172                                     |
    | 30,64,100                                              |
    | 48,50,69,147,168                                       |
    | 59,73                                                  |
    | 96,105,181                                             |
    | 96,181                                                 |
    +--------------------------------------------------------+
    +--------------------------------------------------------+
    | fn_list                                                |
    +--------------------------------------------------------+
    | 37061,37062,37063,37064,37065,37066,37067,37068        |
    | 1425,1426,1427,1428,1429,1430,1436                     |
    | 1419,1420,1421,1422,1423,1424,1431,1432,1433,1435      |
    | 1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,...  |
    | 4359,4360,4361,4362,4363                               |
    | 4572,4576,4577,4580,4634,4635,4644                     |
    | 4482,4483,4559,4560,4608,4622,4623,4624                |
    | 4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,...  |
    | 12054,12086,12087,12102,12103,12105,12108,12109,...    |
    | 21291,21292,21293,21294,21295,21296,21297,21301,...    |
    | 31659,31661,31665,31671,31673                          |
    | 31651,31653,31655,31657,31663,31667,31669,31685,31687  |
    +--------------------------------------------------------+

This leaves us with a list of binaries grouped by 4 or more functions. This isn't perfect for creating signatures because some lines are contained completely in other lines. For example:

    +--------------------------------------------------------+
    | bn_list                                                |
    +--------------------------------------------------------+
    | 18,68,71,83,84,110,119,120,138,148,150,154,157,167,182 |
    | 18,84,119,138,150,157,182                              |
    +--------------------------------------------------------+
    | fn_list                                                |
    +--------------------------------------------------------+
    | 1425,1426,1427,1428,1429,1430,1436                     |
    | 1419,1420,1421,1422,1423,1424,1431,1432,1433,1435      |

Since one entry's binary list is a subset of the other entry's binary list, we can delete the shorter list and avoid having largely duplicate functionality. This is done programmatically once the query is returned to the Python script. I hope this fills in any blanks on grouping the functions. After this, another script is called to extract basic blocks from common functions and generate a signature with those bytes. The original post goes into a bit more detail on that, so I will end here.

Monday, February 17, 2014

I am pleased to announce the creation of a new ClamAV signatures contribution program. My name is Alain Zidouemba and I will be managing this program.

If you would like to submit a ClamAV signature, you may do so by emailing community-sigs [at] lists [dot]  clamav [dot] net. We will require that each signature:

- not be a hash-based signature
- be accompanied by a MD5/SHA1/SHA256 for a sample the signature is meant to detect.
- come with a brief description of the threat the signature is trying to detect and what the signature is looking for

Please DO NOT attach malware to your email. Instead, submit your sample here

Signatures submitted will be tweaked if necessary in order to conform to our standards. After the signature passes quality assurance testing, it will be released with proper attribution unless you prefer to remain anonymous.

You can subscribe to the mailing list here. More information about this program will be added in the FAQ in a few days.

We look forward to a fruitful collaboration on community-sigs [at] lists [dot] clamav [dot] net.

Wednesday, February 12, 2014

Covering malware is a constant fight and the more automation you can integrate, the easier life becomes. This post will go over a relatively easy setup for generating ClamAV signatures based on a set of samples.

I chose to work with OSX malware, specifically targeting Mach-O files. This would give me a relatively small sample set to work with. I downloaded the files from VirusTotal using the search type:macho positives:5+. At the time of download, this yielded 239 samples.

The first problem was grouping samples. Grouping the samples would allow to generate a single signature for multiple samples. One signature for each sample is costly and leads to a bloated signature set. For this, I set up three MySQL tables.

    binaries - stores information about each sample seen
    +-------+-------------+------+-----+---------+----------------+
    | Field | Type        | Null | Key | Default | Extra          |
    +-------+-------------+------+-----+---------+----------------+
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |
    +-------+-------------+------+-----+---------+----------------+

    functions - stores information about each function seen
    +-------+-------------+------+-----+---------+----------------+
    | Field | Type        | Null | Key | Default | Extra          |
    +-------+-------------+------+-----+---------+----------------+
    | id    | int(11)     | NO   | PRI | NULL    | auto_increment |
    | md5   | varchar(32) | NO   |     | NULL    |                |
    | size  | int(11)     | NO   |     | NULL    |                |
    +-------+-------------+------+-----+---------+----------------+

    link_table - associates each binary with a set of functions
    +---------+---------+------+-----+---------+-------+
    | Field   | Type    | Null | Key | Default | Extra |
    +---------+---------+------+-----+---------+-------+
    | prog_id | int(11) | NO   | PRI | NULL    |       |
    | fn_id   | int(11) | NO   | PRI | NULL    |       |
    +---------+---------+------+-----+---------+-------+


The table binaries stores a hash of each program, a unique id, and the program's size. The table functions stores the md5sum of the bytes comprising the function, a unique id, and the size of the function. The table link_table links each binary to the functions it contains. The grouping is done based on common functions between binaries.

In order to populate these tables I wrote an IDAPython script. It iterates through the functions of the program, calculates their md5sum, and then inserts that information into the functions table if its length is greater than 19. The value 19 was selected after some light analysis in order to filter out functions that only consisted of a few instructions. Here is the snippet that populates functions and link_table.

    # for all function offsets
    
for fn_ea in Functions():
        
if fn_ea == None:
           
continue

        # get function from offset
        f = idaapi.get_func(fn_ea)

        # get function bytes
        start = f.startEA
        size = f.endEA - start
        bytes = GetManyBytes(start, size)

        # if the function is sufficiently long
        if bytes != None and len(bytes) > 19:
            fn_hash = md5(bytes).hexdigest().upper()
            fn_size = str(len(bytes))
            fn_data = (fn_hash, fn_size)

            # get function id, or insert and get function id
            fn_id = get_fn_id(cursor, cnx, fn_data)

            # link binary to function
            link_query = 'REPLACE INTO link_table (prog_id, fn_id) VALUES (%s, %s)'
            link_data = (prog_id, fn_id)
            cursor.execute(link_query, link_data)
            cnx.commit().


IDA and this script are called by a batch script for every target binary. Once these tables are populated another script is ran, generate_sigs.py. This script uses the MySQL functionality group_concat to group binaries, based on their common functions, into a list. The problem with this approach is that if binaries A, B, and C share functions x, y, and z, and binaries A and C share functions w, x, y, and z, then we will have duplicates in the list returned. To remedy this problem the script simply loops through the rows returned and if any list of binaries is completely contained in another list, it is removed. Any binary not in these groupings is marked to get its own signature.

Next, the md5sums of the functions common to each group are added to the table communicate. This was the best way for me to pass this information between scripts. Once this table is populated, another IDAPython script is called on the first binary in a group. This script iterates through the functions in the binary and if the function's md5sum matches one in the list of shared functions, its basic blocks are loaded into a table basic_blocks. This table stores the parent function's md5sum, the bytes that comprise the basic block, the basic block's md5sum, the size of the basic block, and its entropy. The byte_ prefix is used to differentiate between attributes of the raw data and the hex encoded version used in the ClamAV signatures.

    communicate - used to pass the md5s of common functions
    +--------+-------------+------+-----+---------+-------+
    | Field  | Type        | Null | Key | Default | Extra |
    +--------+-------------+------+-----+---------+-------+
    | fn_md5 | varchar(32) | NO   | PRI | NULL    |       |
    +--------+-------------+------+-----+---------+-------+

    basic_blocks - stores basic block information from functions

    +--------------+-------------+------+-----+---------+-------+
    | Field        | Type        | Null | Key | Default | Extra |
    +--------------+-------------+------+-----+---------+-------+
    | fn_md5       | varchar(32) | NO   | PRI | NULL    |       |
    | hex_bytes    | mediumtext  | NO   |     | NULL    |       |
    | bb_md5       | varchar(32) | NO   | PRI | NULL    |       |
    | byte_size    | int(11)     | NO   |     | NULL    |       |
    | byte_entropy | double      | NO   |     | NULL    |       |
    +--------------+-------------+------+-----+---------+-------+


Once the basic blocks are stored, the IDAPython script completes and returns the the signature generation script. The basic blocks are queried for, sorted by their parent function and a metric entropy * size. The script then iterates through the functions and selects the best basic block based on the previously mentioned metric. It continues to do this until it has a sufficient amount of bytes. It then constructs an LDB signature.

With my newly created signatures, I ran a test on all the samples I had downloaded.

----------- SCAN SUMMARY -----------
Known viruses: 107
Engine version: 0.98.1
Scanned directories: 1
Scanned files: 239
Infected files: 190

Data scanned: 78.35 MB
Data read: 81.36 MB (ratio 0.96:1)
Time: 2.332 sec (0 m 2 s)


The interesting lines are highlighted. Since this script should give near total coverage, a detection rate of 190/239, while impressive, did not meet my expectations. Something was amiss. My colleague Shaun Hurley noticed that 64 bit Mach-O files were being neglected. Thinking about it, this made sense. IDA has different versions for 32 bit and 64 bit files. I modified the scripts to use idaw64.exe and reran them on the 64 bit binaries. The combined signature set was more impressive.

----------- SCAN SUMMARY -----------
Known viruses: 155
Engine version: 0.98.1
Scanned directories: 1
Scanned files: 239
Infected files: 232

Data scanned: 78.82 MB
Data read: 81.36 MB (ratio 0.97:1)
Time: 2.535 sec (0 m 2 s)


Great success!

This method does have some drawbacks. Since I was running it in a VM, concerns about hard disk space influenced the choice to group based on functions rather than grouping based on basic blocks. This will be fixed by offloading MySQL to a more dedicated machine whose hard drive I can fill up. As well, only common functions between the binaries are considered when selecting basic blocks. This was an oversight on my part since other functions may not be exact matches but could share a lot of common code. With the extra database space, I do not think grouping based on basic blocks is an unreasonable task for these relatively small sets of samples. Building in automatic identification of 32 bit and 64 bit files would remove some manual effort from the process.

A good example of a signature generated for multiple samples is this one for Flashback:

Osx.Trojan.Flashback-16;Engine:51-255,Target:9;0&1&2&3&4&5&6&7&8;5531C089E58B550885D2740B;5589E583EC28895DF48B5D088975F88B750C897DFC8B7D1085DB0F94C285F60F94C031C908C20F858A000000;5589E58B450885C0740B;5589E583EC18C744240801000000C744240414000000C7042400000000E83F020000C74008000000;5589E557565383EC2C8B45100FB7550C8B5D188945E08B451489542404895C24088945E48B451C89;8D4208894424088B450C89142489442404E818FFFFFF0FBEC0;5589E557565383EC1C8B7D0885FF7432;8B450C894210B801000000;C744240801000000C74424040C000000C7042400000000E877010000893089C28B073B03751E

While that signature is just extracted x86, it alerts on the following 15 samples:

B5942F202930DAFF45C79BDC7871C088
548981EF3FCB813FCD3ED2EBAB8102D7
C067B84DC59C93C1363FD9FC56CD2918
B0199B369A3FCC71653ED8A9F7990AFC
4E855DD770680F80A30B9805262BBEE6
EF2DB2EEB040BDF1D0A9A18F2775149B
9272778BB6FBC00131FFCECE51388ACB
BE1B0DB89A4798E6C11E4EBFB6B479AE
CED7C97304BFFD932822565E99460213
B94BF524A537C02DDA4CD047F61E00C4
14DE914B0101C0E7A2C7CF521557E747
657E5A48CEC24F0C6F516CA55581550F
647AF7013D0DA77B6E74D3C692B1B6C3
84352BF4A2FA95FC51AD0781000AA864
93734AEBC1670C22A79F08D1A0FCBD8F


Overall, I'm very happy with these results. Since IDAPro is used to extract everything, this work will translate well to the other binary types that IDA is capable of parsing - most importantly, portable executables.

Tuesday, February 11, 2014

Kaspersky labs released a report that covers in detail a piece of malware known as "Careto" or "The Mask". The report included several MD5 hashes of samples and related files, IP addresses and domain information. Typically with ClamAV,  a hash signature targeting an entire file is formatted as following:

MD5:FileSize:Name

The samples for Careto and therefore their sizes were unavailable to us at the time of this blog post, making it impossible to release hash-based coverage. However, as of ClamAV 0.98, a hash signature can be written with a wildcard for the file size. The format for such a signature is:

MD5:*:Name:73

The 73 on the end will prevent the signature from being loaded by an older ClamAV engine that doesn't support this signature format.

The Mask is a combination of tools that cover 32-bit and 64-bit Windows, Mac OS X and Linux. Kaspersky also identified potential Android and Apple iOS variants. Their analysis indicates it can intercept many different forms of communication from the victim machine, exfiltrate data and provide remote access to the attacker.

This signatures file can be used to detect the sample discussed in the article. Just download it and put it in the same folder where you have your ClamAV signatures. If any alerts are generated from these please let us know by emailing research < at > sourcefire (dot) com.

Thursday, February 6, 2014

This notice is for the members of the ClamAV mailing lists found here:

http://lists.clamav.net/mailman/listinfo/clamav-users

On Monday, February 10th, 2014 starting at 10am EST, the ClamAV Mailing lists will be moving to new server hardware.  We anticipate this outage to last approximately four (4) hours.  We will be notifying everyone when the new server is up and operational.

Thank you for your patience.


Joel Esler
Threat Intelligence Team Lead
Open Source Manager
Vulnerability Research Team