MS Compound Document Files

(Includes documents, spreadsheets, templates and other MS office files)

MS OfficeNext we will look at carving MS Compound Document (and spreadsheet) files, as specified in the document “Open’s Documentation of the Microsoft Compound Document File Format.” For complete details of the file format specification, please refer to the hyperlink to the document, listed on page 1 of this paper.

As quoted from the above referenced document, “Compound document files are used to structure the contents of a document in the file.  It is possible to divide the data into several streams, and to store these streams in different storages in the file.  This way compound document files support a complete file system inside the file, the streams are like files in a real file system, and the storages are like sub-directories.”

All streams of a compound document file are divided into sectors. Sectors may contain internal control data of the compound document or parts of the user data.  The entire file consists of a compound document header and a list of all sectors following the header..  The size of the sectors can be set in the header and is fixed for all sectors then.

…and so on…

As we discussed in the section on Zip files, if you know what you are looking for, and where you expect to find it within the file, you can determine exactly what data belongs to the file in question and whether or not there is fragmented data within the file.

We start by searching for the Compound Document Header, “D0 CF 11 E0 A1 B1 1A E1,” to identify the beginning of each of the MS compound documents.  Next, at offset 0x1E from the beginning of the header we find a 2-byte value that identifies the sector size used in the document, which is usually 512-bytes/sector.  Now, knowing the size of each sector that makes up the file, we can start looking for document structures and where within the file they should be located.  As noted in the Zip file process mentioned earlier in this paper, the difference between the EXPECTED location of a structure and its ACTUAL location is the size of the fragmented data that doesn’t belong to the file.

At file offset 0x2C, we find the # of sectors used by the Sector Allocation Table (SAT).  Next, at file offset 0x30 we find the starting sector number (within the file) of the file’s Directory.  Another important file structure is the Short-Sector Allocation Table (SSAT), whose starting sector # is located at file offset 0x3C, followed by the number of sectors making up the SSAT, located at file offset 0x40.  Not all compound documents utilize a SSAT, in which case you can ignore these 8 bytes.  And lastly, we look at the Master Sector Allocation Table (MSAT), whose starting sector # is located at file offset 0x44, followed by the number of sectors making up the MSAT, located at file offset 0x48.  The following 436 bytes of data, which make up the rest of the first 512 bytes of the compound document file, contain the first 109 sector IDs (SID) of the MSAT and starts at file offset 04C.

So, now that you know where certain items should be located, the next step is to located them on the disk and find out if they are located at the expected sector number in relation to the start of the document.

First, using the first sector of the MSAT from the 4-byte value at offset 0x4C, search for “01 00 00 00 02 00 00 00 03 00 00 00 04 00 00 00” to find the beginning of the MSAT and compare the sector number you find the MSAT located at with the results of the sector # of the start of the document plus the 4-byte value at offset 0x4C.  If there is a difference, then a fragmentation occurs before the start of the MSAT.

Secondly, search forward for the beginning of the Directory, starting from the document’s header. The signature for the start of the Directory is “52 00 6F 00 6F 00 74 00 20 00 45 00 6E 00 74 00 72 00 79 00” (or “Root Entry” in case sensitive Unicode).  There may be left over instances of previous Directory Entries from previous file edits, so look for more than one instance of the “Root Entry”.  Once you find the sector # of the start of the Directory, subtract the sector # of the start of the document, and compare the result against the 4-byte value at file offset 0x30.  If the result matches your 4-byte value then no fragmentation exists between the start of the file and the Directory.  If there is a difference, the difference is the amount of fragmented data that doesn’t belong to the document.

And lastly, review of the individual Directory Entries for the starting sector numbers and stream size of the objects will assist in determining where, before or after each object, any file fragmentation occurs.

The largest object within the compound document is most likely the “WordDocument” object, or”Workbook” object for spreadsheets.  Which means that if fragmentation exists within a large compound document, it is likely that the fragmentation occurs within those streams.  As was mentioned earlier, through a process of elimination and/or manual review of the carved block for a block of data the size of your determined fragment for data that doesn’t belong to the document.

The directory is an array of directory entries.  Each directory entry is a 128-byte entry and is listed in order of their appearance in the document.  It identifies the starting sector # of that file object, at directory entry offset 0x74 and the size of that object (in bytes) at offset 0x78.