I recently wrote a blog post
Generating ClamAV Signatures with IDAPython and MySQL. In the comments, I was asked for more details on how the script generate_sigs.py groups binaries by functions.
The three tables used were shared in the previous post, but for convenience, here they are again.
binaries - stores information about each sample seen
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| md5 | varchar(32) | NO | | NULL | |
| size | int(11) | NO | | NULL | |
+-------+-------------+------+-----+---------+----------------+
functions - stores information about each function seen
+-------+-------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------+-------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| md5 | varchar(32) | NO | | NULL | |
| size | int(11) | NO | | NULL | |
+-------+-------------+------+-----+---------+----------------+
link_table - associates each binary with a set of functions
+---------+---------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------+---------+------+-----+---------+-------+
| prog_id | int(11) | NO | PRI | NULL | |
| fn_id | int(11) | NO | PRI | NULL | |
+---------+---------+------+-----+---------+-------+
To get the groups of binaries you find binaries that share a number of common functions. I'll build out the MySQL query so it is understandable. The inner query is here:
SELECT fn_id,
group_concat(prog_id
ORDER BY prog_id) AS bn_list, # get a list of binaries
count(*) AS pcnt # count the binaries in that list
FROM link_table
GROUP BY fn_id HAVING pcnt > 2 # filter the results
ORDER BY bn_list;
This gives a list of functions and their associated binaries if more than 2 binaries are associated with that function.
+-------+----------------------------------------------------------+------+
| fn_id | bn_list | pcnt |
+-------+----------------------------------------------------------+------+
| 993 | 10,16,63,74,76,87,92,93,124,126,129,135,145 | 13 |
| 994 | 10,16,63,74,76,87,92,93,124,126,129,135,145 | 13 |
| 995 | 10,16,63,74,76,87,92,93,124,126,129,135,145 | 13 |
| 1021 | 11,15,28,77,86,91,136 | 7 |
| 1116 | 11,15,28,86,136 | 5 |
| 1258 | 12,20,22,127 | 4 |
| 1118 | 12,22,127 | 3 |
| 1364 | 14,24,140 | 3 |
| 1434 | 18,59,68,71,73,83,84,110,119,120,137,138,148,150,154,157 | 16 |
| 1425 | 18,68,71,83,84,110,119,120,138,148,150,154,157 | 13 |
| 1426 | 18,68,71,83,84,110,119,120,138,148,150,154,157 | 13 |
| 1427 | 18,68,71,83,84,110,119,120,138,148,150,154,157 | 13 |
| 1428 | 18,68,71,83,84,110,119,120,138,148,150,154,157 | 13 |
| 1429 | 18,68,71,83,84,110,119,120,138,148,150,154,157 | 13 |
| 1430 | 18,68,71,83,84,110,119,120,138,148,150,154,157 | 13 |
| 1436 | 18,68,71,83,84,110,119,120,138,148,150,154,157 | 13 |
...
This list is quite long so I've truncated it. An example result, function 1425 has 13 binaries associated with it, those binaries' ids are listed. That's great, but we really want a list of binaries and a list of functions that associate those binaries. So, we now embed the original query in a similar query that creates a list of functions grouped by the bn_list field.
SELECT bn_list,
group_concat(fn_id
ORDER BY fn_id) AS fn_list # get a list of functions for each bn_list
FROM
(SELECT fn_id,
group_concat(prog_id
ORDER BY prog_id) AS bn_list,
count(*) AS pcnt
FROM link_table
GROUP BY fn_id HAVING pcnt > 1
ORDER BY bn_list) AS t
GROUP BY bn_list HAVING count(*) > 4; # get groups connected by > 4 functions
I also added count(*) < 23 to the last line of this query to get readable output. The resulting table is split and truncated below. Each row in bn_list corresponds to the same row in fn_list.
+--------------------------------------------------------+
| bn_list |
+--------------------------------------------------------+
| 121,131 |
| 18,68,71,83,84,110,119,120,138,148,150,154,157,167,182 |
| 18,84,119,138,150,157,182 |
| 19,81,115,173 |
| 26,95,142,146,165,183 |
| 27,128 |
| 27,37,53,70,79,172 |
| 30,64,100 |
| 48,50,69,147,168 |
| 59,73 |
| 96,105,181 |
| 96,181 |
+--------------------------------------------------------+
+--------------------------------------------------------+
| fn_list |
+--------------------------------------------------------+
| 37061,37062,37063,37064,37065,37066,37067,37068 |
| 1425,1426,1427,1428,1429,1430,1436 |
| 1419,1420,1421,1422,1423,1424,1431,1432,1433,1435 |
| 1437,1438,1439,1440,1441,1442,1443,1444,1445,1446,... |
| 4359,4360,4361,4362,4363 |
| 4572,4576,4577,4580,4634,4635,4644 |
| 4482,4483,4559,4560,4608,4622,4623,4624 |
| 4779,4780,4781,4782,4783,4784,4785,4786,4787,4788,... |
| 12054,12086,12087,12102,12103,12105,12108,12109,... |
| 21291,21292,21293,21294,21295,21296,21297,21301,... |
| 31659,31661,31665,31671,31673 |
| 31651,31653,31655,31657,31663,31667,31669,31685,31687 |
+--------------------------------------------------------+
This leaves us with a list of binaries grouped by 4 or more functions. This isn't perfect for creating signatures because some lines are contained completely in other lines. For example:
+--------------------------------------------------------+
| bn_list |
+--------------------------------------------------------+
| 18,68,71,83,84,110,119,120,138,148,150,154,157,167,182 |
| 18,84,119,138,150,157,182 |
+--------------------------------------------------------+
| fn_list |
+--------------------------------------------------------+
| 1425,1426,1427,1428,1429,1430,1436 |
| 1419,1420,1421,1422,1423,1424,1431,1432,1433,1435 |
Since one entry's binary list is a subset of the other entry's binary list, we can delete the shorter list and avoid having largely duplicate functionality. This is done programmatically once the query is returned to the Python script. I hope this fills in any blanks on grouping the functions. After this, another script is called to extract basic blocks from common functions and generate a signature with those bytes. The original post goes into a bit more detail on that, so I will end here.