ClamAV added support for a new family, or perhaps old family of documents, in the 0.99.1 release: Hangul Word Processor (HWP) documents. HWP documents are document files specialized in the Korean language and developed by Hancom Inc.. The documents’ specialization in Korean make it a highly popular format used in South Korea with the government being a notable example. As popular format, it is subjected to possibly malicious content.
For this release, we primarily targeted the word-specific documents: HWP 2.x, HWP 3.x, HWP 5.x, and HWPX. The other file formats developed by Hancom Inc. which cover the spreadsheet and presentation formats are already handled by pre-existing methods in ClamAV; the only exception is HPT (an old presentation format) which will not be in this release.
- HWP 2.x, also known as HWPML
- XML-based document format similar to Microsoft’s older XML document format
- Contents of the document is stored in the XML including all embedded content
- Embedded content is usually base64-encoded and normally uses zlib compression
- General embedded content is stored in OLE2 containers
- File property collection: document’s attributes and metadata fields
- HWP 3.x, also known as HWP
- Custom binary file format. For additional information on the format, the documentation can be retrieved from Hancom’s website (note that it’s in Korean)
- Contents of HWP 3.x are stored in a file segment that uses optional password encryption and normally uses zlib compression
- Embedded content is stored in the content stream with general embedded content stored in HWP-styled* OLE2 containers
- File property collection: data from various file headers
- HWP 5.x, also known as HWP
- OLE2-based document format similar to Microsoft’s 97-2003 document formats
- Contents of HWP 5.x are stored in individual streams with zlib compression normally used on specific content streams including embedded content
- Embedded content is stored in individual streams under the BinData directory with general embedded content stored as HWP-style* OLE2 containers
- File property collection: data from the fileheader stream which appears to be the HWP legacy header and the /x005HwpSummaryInformation stream which uses the same property method as 97-2003 Microsoft documents
- HWPX
- OOXML-compliant document format
- Contents of the document are stored in XML documents within a ZIP archive
- Embedded content is stored in the BinData directory with general embedded content stored as HWP-style* OLE2 containers
- File property collection: data from the content.hpf document which is an xml document with the legacy file header.
*Hwp-styled OLE2 container are identical to normal OLE2 container with the exception that a 32-bit value is prepended to file/stream/data segment