Wednesday, March 5, 2014

Programmatic Boolean Simplification and ClamAV Signatures

ClamAV uses boolean logic in its LDB signatures. Each LDB signature has a set of subsignatures that, when present, evaluate to True in its logical statement. When the logical statement evaluates to True, the signature alerts. My goal was to create something to clean up the logical statement in these signatures.

Logical & (and) and logical | (or) are the primary operators in ClamAV LDB signatures. An example representative of most LDB logical blocks would look something like 0&(1|2). This would indicate that for this virus, subsignature 0 must be present, and one of subsignature 1 or subsignature 2 must also be present for an alert to be triggered. ClamAV does allow you to specify how many times a subsignature can alert with the operators =, <, and >. For example, 0=0&1 would indicate that subsignature 0 should occur zero times in a file and subsignature 1 must be present. These conditional statements (example, 0>3) are treated in the same manner as a single subsignature (example, 0) in that they evaluate to True or not. This simplifies our problem because negations (N=0) are treated as positive matches and can simply be substituted with a dummy variable. For a more detailed description of ClamAV LDB signatures, please refer to signatures.pdf.

My goal starting out was to take a boolean equation and represent it in a form with minimal bytes. This means that unnecessary logic would need to be eliminated, as well, any boolean identity to reduce the number of bytes used should be applied. Some examples:

    Original              Reduced
    0|(0&1)               0 
    (0&1)|(0&2)           0&(1|2)

Over some months, I kept seeing cases where this capability would be useful. For example, if you had some set of subsignatures and some set of files, automatically construct an LDB signature that alerts on all those files. Another situation would be when LDB signatures are submitted. With an algorithm like this, a tool to could be written to automatically verify that the logical block is in a minimal representation. Signature verification and improvement is what I will focus on in this post.

Algorithms

With the utility for this growing, I began. The biggest difficulty I had starting out was that I had two assumptions. First, that there would be a single algorithm able to achieve this. That was incorrect. As well, I thought that the method for doing this would be available online. That was also incorrect.

The first step was to parse the logical block into something that is more easily manipulated programmatically. For this I implemented the Shunting Yard Algorithm to parse the equation into polish notation. Instead of rebuilding a prefix string, the logical block is translated into an expression tree. This program was implemented in Python, so the expression tree consisted of dicts and lists. For example:

    Statement:
    0&1&(2|3)

    Python:
    {'&':[0, {'&': [1, {'|': [2, 3]}]}]}

Once the expression tree is built, it is converted to disjunctive normal form. This makes the statement an "or (|) of ands (&)". If that description is hard to digest, it can be thought of as the boolean equivalent of a sum of products. This form is reached by recursively applying the boolean identity for distribution in order to expand the equation. The algorithm for this is to identify a the situation where an and (&) sub-tree has an or (|) child. You then expand and distribute the children. The difficult thing to figure out with this part was that once a branch was changed, the algorithm must be run again on that branch until it stops changing. An example of the input and output:

    Statement:
    0&(1|0)&2

    Distributed:
    (0&1&2)|(0&0&2)

Any duplicate and (&) sub-statement is eliminated. This prepares the logical block for the Quine-McCluskey algorithm. This algorithm is the programmatic equivalent of a Karnaugh Map. That is, it will minimize a boolean expression. However, it will only do this minimization in a logical sense. In terms of representation, the equation will likely still not be optimized. In the Quine-McCluskey algorithm, a truth table is constructed. This was represented by a list of lists in Python. Using the above example, we would get:

    Truth table:
    | 0 1 2 |
    | ===== |
    | 1 1 1 |    // 0&1&2 
    | 1 0 1 |    // 0&2

    Python:
    [[1,1,1], [1,0,1]]

Already, the redundant 0&0 has been reduced to a single 0 in the second statement. The algorithm then sorts the truth table entries based on their count of 1s. Any two terms that differ only by one digit can have that digit marked with a dash (-) to indicate that the term does not matter. For instance:
    Truth table:
    | 0 1 2 |
    | ===== |
    | 1 - 1 |
    | 1 - 1 |
I used 2 in place of a dash, since this was a list of integers it would keep processing simple.
    Python:
    [[1,2,1], [1,2,1]]
This shows that we can actually eliminate subsignature 1. Since the full signature will fire with or without that subsignature, it is not necessary to keep. Since this simple example does not demonstrate it, I will skip over the selection portion of Quine-McCluskey. The full algorithm is detailed on Wikipedia (linked above). We now have deleted subsignature 1 and have changed subsignature 2 to 1, leaving us with:

    Result: 
    0&1 

Now that we have a logically efficient equation we must minimize it. For this, the script recursively applies boolean identity for distribution. However, this time, it applies it in a way that extracts common terms from statements. If the current key is or (|), the function finds the most common value in this branch at a depth of 1. This means that it will return if it reaches another or (|) after passing an and (&).

The previous case is completely reduced, so I will use a different example here.

    Statement:
    (0&1&2)|(0&1&3)

                 |
             /       \
            &         &
          /   \     /   \
         &     0   0     &
       /   \           /   \
      3     1         1     2

It would start at the root node (|), then find the most common value in its subtree. Both 0 and 1 occur twice in the subtree, so it selects whichever it saw first. It then extracts that value and rebuilds the tree.

              &
          /       \
         1         |
               /       \
              &         &
            /   \     /   \
           3     0   0     2 

It continues to do this until no more values can be extracted.

    Result:
    0&1&(2|3)


This algorithm was not producing the results I wanted on larger equations until I added the capability for it to select subtrees as the most common value in addition to integers (leaf nodes).

The program then rebuilds the logical statement in prefix notation from the expression tree. It checks the difference in length between the original and simplified signature. If there is an improvement, the original and simplified equations are converted to a form that Z3 Python can read and equivalence is verified. Z3 has a slightly different syntax, but the biggest consideration is that ClamAV's method of representing variables as number is incompatible, so each number must be converted to a letter.

    slvr.add(Not(simp_z3 == orig_z3))
    
    if slvr.check() == unsat:
        if opt_demo:
            print 'Equivalence proven!'
        return True
    else:
        if opt_demo:
            print 'Not equivalent!'
            print 'Counterexample:'
            print s.model()
        return False

You ask Z3 to prove that the simplified equation does not equal the original equation. If this request is unsatisfiable, then you have proven the two equations are equivalent. Once equivalence is verified, the LDB signature is reconstructed with the improved logic and printed out.

Examples


This first example demonstrates the scripts ability to extract common values and reduce a signature via the distribution identity.

    ====================
    Test.Signature;Engine:51-255,Target:0;(0&2&3&4)|(1&2&3&4);41414141;42424242;43434343;45454545;46464646
    
    ORIG:  (0&2&3&4)|(1&2&3&4)
    SIMP:  (0|1)&2&3&4
    Equivalence proven!
    Reduced size by 8 bytes.
    
    Test.Signature;Engine:51-255,Target:0;(0|1)&2&3&4;41414141;42424242;43434343;45454545;46464646
    ====================

In this example the script combines terms to simplify the final signature.

    ====================
    Test.Signature;Engine:51-255,Target:0;0&(1|2)&((3&(5|6))|(4&(5|6)));41414141;42424242;43434343;45454545;46464646;47474747;48484848
    
    ORIG:  0&(1|2)&((3&(5|6))|(4&(5|6)))
    SIMP:  0&(1|2)&(3|4)&(5|6)
    Equivalence proven!
    Reduced size by 10 bytes.

    Test.Signature;Engine:51-255,Target:0;0&(1|2)&(3|4)&(5|6);41414141;42424242;43434343;45454545;46464646;47474747;48484848
    ====================

A simple situation would be eliminating redundant logic.

    ====================
    Test.Signature;Engine:51-255,Target:0;((0&1)|(1&0));41414141;42424242
    
    ORIG:  ((0&1)|(1&0))
    SIMP:  0&1
    Equivalence proven!
    Reduced size by 10 bytes.
    
    Test.Signature;Engine:51-255,Target:0;0&1;41414141;42424242
    ====================

As well, there are situations where the logic can make a subsignature unnecessary. When a subsignature is deleted, there is a little bookkeeping that needs to be done to make sure Z3 still verifies it. When building the Z3 equations it is important to note that subsignature 1 in the simplified equation is subsignature 2 in the original.

    ====================
    Test.Signature;Engine:51-255,Target:0;0&(1|0)&2;41414141;42424242;43434343
 
    Removing subsig 1 

    ORIG:  0&(1|0)&2
    SIMP:  0&1
    Equivalence proven!
    Reduced size by 15 bytes.
    Test.Signature;Engine:51-255,Target:0;0&1;41414141;43434343
    ====================

ClamAV Signature Set

The next question is what happens if we run this against the ClamAV LDB signature set. As of writing this, there are currently 615 LDB signatures in daily.cld. If this tool were to have been used on all the LDB signatures prior to publishing, it would have saved 712 bytes. A large portion of these savings comes from parentheses around the outside of the equation that are dropped. The tool would take something like (0&1) and produce 0&1.

If we were to update all the signatures with this tool there is an additional factor to consider. The first time a signature is updated a number is appended to the end of the signature name. For example, if you were to update the signature Win.Trojan.Agent it would be renamed Win.Trojan.Agent-1. That is a cost of 2 bytes. Any additional update increments the number on the end of the signature name. If the signature ended in -9, -99, etc., then there would be a cost of one byte when it is updated. Considering this, an update to the signature database would save 446 bytes. The fact that this is a relatively low number is reassuring to me - it means that we are writing pretty good signatures.

Conclusion

This problem was a lot of fun to solve since the solution was not available anywhere online. In future research, having this functionality will enable me to generate robust LDBs to cover clusters of malware with a single signature. On a functional level, this tool will save space and improve the overall quality of signatures published in the future. Thanks for reading.