Friday, March 19, 2021

ClamAV, CVDs, CDIFFs and the magic behind the curtain

The amount of malicious files that ClamAV can detect has increased immensely over the past few years, but with this increase in efficacy comes some challenges with scale.  

Some of these challenges have required drastic measures to ensure the effective operation of the ClamAV infrastructure, including blocking certain methods of downloading the official ClamAV signature sets. To give the community more insight into these matters, we’d like to discuss some of these challenges in-depth and provide insight into future changes and optimizations coming to the product.

ClamAV signatures come in a variety of formats, one for each of the distinct detection methods that the ClamAV file scanning engine supports. ClamAV also uses the ClamAV Virus Database (CVD) file format, which serves as a container for the compressed and digitally-signed official signature sets that power ClamAV — daily.cvd, main.cvd, and bytecode.cvd. Each signature set serves a different purpose:

  • bytecode.cvd contains all compiled bytecode signatures evaluated by the bytecode interpreter engine
  • daily.cvd contains signatures for the latest threats (updated daily)
  • main.cvd contains signatures previously in daily.cvd that have shown to have a low false-positive risk.

The FreshClam utility facilitates the downloading and updating of official signature sets. Here’s a full technical breakdown of how FreshClam works:

1. A DNS request is made to current.cvd.clamav.net for a TXT record containing information about the latest signature sets. An example TXT file record can be seen in the output below:

$ dig +noall +answer current.cvd.clamav.net TXT 

current.cvd.clamav.net. 1589 IN TXT

"0.103.1:59:26104:1615408140:0:63:49191:333"

Several of the fields included in the TXT record contents are: 

  • The most recently released ClamAV version (0.103.1) 
  • The version of the most recently published main.cvd (59) 
  • The version of the most recently published daily.cvd (26104)
  • The version of the most recently published bytecode.cvd (333)

2. FreshClam checks the ClamAV virus database directory (indicated by the DatabaseDirectory value in the freshclam.conf that FreshClam uses) for existing instances of main.cvd, daily.cvd, or bytecode.cvd. For main and daily, if the CVD can’t be found it also looks for main.cld and daily.cld. These CLD files are uncompressed and unsigned versions of the CVD that have had CDIFFs applied.

3. For any of the official signature sets that can’t be found, FreshClam will download the corresponding CVD from the server indicated by DatabaseMirror in freshclam.conf (the default is database.clamav.net). This is an expensive operation in terms of bandwidth because daily.cvd and main.cvd are, currently, 105 MB and 117 MB, respectively. For the official signature sets that exist on disk, though, FreshClam copies them to a temporary directory and attempts to update them in place using signed CDIFF files.

A ClamAV CDIFF file is generated every time a new release of the daily.cvd or main.cvd is made, and the files exist on the mirror server using the following format: <database name>-<database version>.cdiff. For example, the CDIFF for daily corresponding to the DNS record shown above would be daily-26104.cdiff. Each CDIFF contains the lines to be added or removed to the various text-based ClamAV signature files in the CVD, and these CDIFFs are relatively small, even when many signatures have been added or removed. For example, for an update where 10,000 signatures were removed from daily, the corresponding CDIFF was only around 60 KB in size.

To update via CDIFF, FreshClam determines the version of the database on disk and requests every CDIFF between that version and the latest. Assuming each of those CDIFFs exists on the server (only the last 90 days worth are currently kept) and is downloaded successfully, FreshClam will apply them in order until the CVD has been successfully updated. If a CDIFF cannot be downloaded successfully, FreshClam will stop attempting to apply CDIFFs and will download the CVD directly. On rare occasions, the ClamAV team may intentionally publish a CDIFF that is empty. A zero-byte CDIFF indicates that FreshClam should download the CVD instead. This is sometimes preferred to patching when a significant portion of the CVD changes, like when a large portion of daily is migrated to main in a single update.

4. Once the CVDs have been downloaded or updated from CDIFFs, FreshClam defaults to performing a test of the signature — it loads the signature sets into memory the same way that ClamD or ClamScan would. Assuming this test is successful, FreshClam overwrites the CVD/CLD files in the ClamAV virus database directory and optionally notifies any running ClamD instances that new signatures are available.

It’s clear to see why using FreshClam to update via CDIFFs is the preferred method for updating ClamAV database files on user machines — the bandwidth savings are immense, especially when considering that tens of millions of devices use ClamAV and rely on database.clamav.net as the mirror. Analysis of CVD download requests has shown, though, that a surprising number of users attempt to download the full signature set multiple times a second using tools like wget, curl or lynx. This activity is in large part responsible for the huge amounts of bandwidth expended while serving up the ClamAV database files — more than 9 Petabytes last month.  The team’s taken drastic measures to reduce this, such as blocking these types of download attempts and rate-limiting of requests. This will be discussed in more detail in a separate blog post.

The ClamAV team also introduced a new downloader tool for users who want to run a private mirror. FreshClam does not currently preserve CDIFF files — it downloads and does not update the CVD files in such a way that makes them most suitable for serving up via a third-party mirror. Several prior solutions for this leveraged tools like wget behind the scenes and were difficult to distinguish from abusive usage of wget. The latest tool provides all of the benefits of FreshClam while also making it simpler for users to download the files needed on private mirrors.

The Talos Malware Research Team has also been working to reduce the size of daily.cvd. Daily has grown substantially over the past few years, as Talos has invested significant amounts of time and effort into improving the infrastructure used to automate ClamAV signature creation and testing. This has allowed signatures to be pushed out much faster than was possible previously. Whereas much of the automated coverage possible in the past was primarily hash-based, we increasingly create logical signatures that match on large clusters of malicious files. For comparison, a daily.cvd from October 2017 had around 4,000 NDB/LDB (content-based) signatures and over 1,600,000 hash-based signatures, while a daily from December 2020 had over 220,000 NDB/LDB signatures and more than 4,000,000 hash-based signatures. While this points to major improvements from a coverage perspective, the downside is that a lot of memory is required to hold the signatures — for example, around 850 MB to hold the signatures from the December daily, in addition to the 330 MB needed to hold the signatures from main. 

Using over 1 GB of memory means that devices with little free memory or devices with less than 1 GB to begin with will likely encounter issues loading the ClamAV signature set at all. FreshClam’s default behavior of testing signatures after they are downloaded can result in problems for devices with less than 2 GB of available memory. If ClamD is already running with signatures loaded when the FreshClam signature load test occurs, the amount of required memory will be doubled and can result in the FreshClam failing. In instances where FreshClam is set to run on a cron job and not actively monitored, updates might fail for months at a time without the ClamD instance going down. In addition to leaving users unprotected from the latest threats, an interesting side effect of this is that it can lead to a huge increase in bandwidth consumption. If updates fail for more than 90 days, FreshClam will be unable to find all of the needed CDIFFs on the server and will begin downloading the full daily.cvd every time it’s run.

An immediate workaround for this issue is to disable the FreshClam database testing functionality via the TestDatabases option in freshclam.conf. Each CVD that is released goes through load testing on a wide range of supported ClamAV versions before being made available on the update servers, so the option to have FreshClam test the database before loading likely yields little benefit for the majority of users running supported ClamAV versions. FreshClam also verifies the digital signature of each CVD and CDIFF that is downloaded before using it, so users will continue to be protected from CVDs that might get tampered with or otherwise corrupted during the download process for some reason. This solution is not recommended for those that have the available memory and would prefer the FreshClam update process fail rather than ClamD fail to load in the case that a CVD issue happens to slip through our testing.

In regard to actually reducing the memory footprint of daily.cvd, the Malware Research Team has focused on retiring certain older hash-based signatures and certain NDB/LDB signatures that have not been shown to detect multiple samples in the wild. Today, daily.cvd contains 170,000 NDB/LDB sigs and 3,700,000 hash-based signatures, with a corresponding memory footprint of 740 MB. More work is needed in this area, especially to get below the 1 GB threshold, so investigations into potential areas of reduction will continue.

Daily.cvd and main.cvd memory usage over time.

The team is currently evaluating several scalable, longer-term solutions. These solutions aim to provide ClamAV users with more options for tailoring ClamAV to their environment, while also ensuring that the default behavior makes sense for most users. We encourage the community to bring forth ideas (or pull requests) for features that would enable ClamAV to operate more efficiently in their environment.

For any questions that might arise relating to FreshClam, CDIFFs, or any other topic discussed in this blog post, don’t hesitate to reach out via the ClamAV email lists, in IRC, or in Discord. We appreciate everyone’s patience and support as we work to ensure that ClamAV remains stable and reliable for everyone.