Thursday, November 20, 2014

Intro to Collection and Analysis of File Properties with ClamAV 0.98.5

In ClamAV 0.98.5, a new feature provides for file property collection and analysis.  The feature is intended for software developers and analysts who want to include the collection and analysis of file properties in their applications in addition to scanning file content. ClamAV 0.98.5 collects properties on the following file types:
  • Microsoft Word
  • Microsoft Excel
  • Microsoft Powerpoint
  • Office Open XML (OOXML) Document
  • Office Open XML Presentation
  • Office Open XML Workbook
  • Microsoft Portable Executable
  • Adobe Portable Document Format

How does it work? There are three main areas to understand about the file properties scanning features.

The first area is about how to initiate the file properties collection through the ClamAV API. A program using the ClamAV API may indicate property scanning by setting an option. The option is required to invoke the file property collection scan mode. Additionally, the API provides a new callback function to enable custom processing such as tailoring the result of the property scan, or to write the json properties string to a file. Use of the callback function is optional.

By way of example, the command line program clamscan uses the following API call to scan a file:
if((ret = cl_scandesc_callback(fd, virpp, &info.blocks, engine, options, &chain) ...
If you are interested in more detail, clamscan/manager.c contains this API call, and libclamav/clamav.h contains the function prototype of cl_scandesc_callback. In this case, to indicate file properties scanning to the ClamAV engine from clamscan, use the following sequence:
options |= CL_SCAN_FILE_PROPERTIES;
if((ret = cl_scandesc_callback(fd, virpp, &info.blocks, engine, options, &chain) ...
That's it! clamscan will now collect file properties for analysis. In fact, this is basically what clamscan does to support the new --gen-json option.

That brings us to the second area, which involves how file properties are collected and stored. We use json for this purpose. As ClamAV scans a file, it gathers properties about the file as well. The properties are maintained as json objects. At the end of the file scan, and prior to the file properties analysis scan, ClamAV serializes the json objects to the json files properties string.

The file properties json is a recursive structure of other file property objects. At the top level, the file property object that may be thought of as representing the file as a container, The generic schema of the json file property object contains the following objects:
  • FileSize
  • FileType
  • FileMD5
  • ContainedObjects (Array)
  • Viruses (Array)
The ContainedObjects is an array of other embedded file objects, such as a spreadsheet embedded within a Word document. The whole pattern is repeated, in this case, where the embedded spreadsheet object may have its own ContainedObjects array. There are additional file properties that are dependent upon the particular file type. A complete list of file properties and their parent file types can be found in ClamAV_Document_Properties.xlsx within the ClamAV docs/ directory.

You can see the generated json file property string from the command line by using:
clamscan --gen-json --debug <some office, pdf, or pe file>
The third and last area is the analysis of the generated json file properties string.  After the original file scan, ClamAV performs a second scan on the resulting file properties string.  Both pattern signatures and bytecode programs may by used for analysis the file properties string. ClamAV handles these signatures identically to those used on normal files, except that the target file is the generated json file properties string. Signatures should specify the json properties file target type of 13 to ensure the signature operates on the file properties string. Bytecode signatures will be the most useful for the analysis phase. Several upcoming blog posts in this series will discuss in detail about writing bytecode programs for property scanning and using the bytecode json API. You can also see sample byte code signature in ClamAV 0.98.5 directory examples/fileprop_analysis/.

One final note: To use file property scanning, json-c must be installed and configured into ClamAV. If it is not already installed, you can obtain json-c from https://github.com/json-c/json-c or install it using your package manager. We support json-c versions 0.9 through 0.12, but recommend version 0.12. After installing json-c, include it into your ClamAV configuration using './configure --with-libjson'. See ./configure --help for additional information about ClamAV configure options. Note that json-c is not required by any other ClamAV facility.