Wednesday, March 9, 2016

ClamAV 0.99.1: Hangul Word Processor (HWP) Document Support

ClamAV added support for a new family, or perhaps old family of documents, in the 0.99.1 release: Hangul Word Processor (HWP) documents. HWP documents are document files specialized in the Korean language and developed by Hancom Inc.. The documents’ specialization in Korean make it a highly popular format used in South Korea with the government being a notable example. As popular format, it is subjected to possibly malicious content.

For this release, we primarily targeted the word-specific documents: HWP 2.x, HWP 3.x, HWP 5.x, and HWPX. The other file formats developed by Hancom Inc. which cover the spreadsheet and presentation formats are already handled by pre-existing methods in ClamAV; the only exception is HPT (an old presentation format) which will not be in this release.
  • HWP 2.x, also known as HWPML
    • XML-based document format similar to Microsoft’s older XML document format
    • Contents of the document is stored in the XML including all embedded content
    •  Embedded content is usually base64-encoded and normally uses zlib compression
    • General embedded content is stored in OLE2 containers
    • File  property collection: document’s attributes and metadata fields
  • HWP 3.x, also known as HWP
    • Custom binary file format. For additional information on the format, the documentation can be retrieved from Hancom’s website (note that it’s in Korean)
    • Contents of HWP 3.x are stored in a file segment that uses optional password encryption and normally uses zlib compression
    • Embedded content is stored in the content stream with general embedded content stored in HWP-styled* OLE2 containers
    • File property collection: data from various file headers
  • HWP 5.x, also known as HWP
    • OLE2-based document format similar to Microsoft’s 97-2003 document formats
    • Contents of HWP 5.x are stored in individual streams with zlib compression normally used on specific content streams including embedded content
    • Embedded content is stored in individual streams under the BinData directory with general embedded content stored as HWP-style* OLE2 containers
    • File property collection: data from the fileheader stream which appears to be the HWP legacy header and the /x005HwpSummaryInformation stream which uses the same property method as 97-2003 Microsoft documents
  • HWPX
    • OOXML-compliant document format
    • Contents of the document are stored in XML documents within a ZIP archive
    • Embedded content is stored in the BinData directory with general embedded content stored as HWP-style* OLE2 containers
    • File property collection: data from the content.hpf document which is an xml document with the legacy file header.

*Hwp-styled OLE2 container are identical to normal OLE2 container with the exception that a 32-bit value is prepended to file/stream/data segment