I’ve blogged recently comparing the two contenders for the standard office XML format crown: the Sun/IBM sponsored Open Document Format (ODF) and the Microsoft sponsored MS Office Open XML format (MSOOX). Also I’ve blogged recently on various metrics for XML: magic numbers that help provide objective evidence for help characterize things like complexity in documents, to help evaluation and produce estimation. A reader, unsurprisingly, asked if I could combine the two threads and provide some metrics on ODF and MSOOX.
Fair enough! Here are some XML metrics for a large document with almost 180,000 words, tables, lists, sidebars and some graphics. I chose a large document so that bootstrap effects would be minimized. I used the ODF v.1.0 specification, converting it from .SWX to .DOC and .ODT in Open Office 2.0, then converting the .DOC to .DOCX in Word 2007 beta. Then I used a COTS archiver to treat the ODT and DOCX files as ZIP archives, and extracted the XMLfiles containing the basic text and markup: content.xml (ODF) and word/document.xml (MSOOX). I chose to use a .SWX format because I didn’t want to have any MS-dependencies in the data, .DOC being proprietary.
I also resaved the document to .DOC, re-opened it and re-exported it to .DOCX and extracted the word/document.xml file. Resaving data is a good trick when doing data conversion, because it removes extraneous information or structures from the source: the first .DOC are what Open Office thinks .DOC looks like, the second .DOC is what Microsoft does things.
I used the upcoming release of the Topologi Complexity Detective to create the metrics. The reports on the ODF document are here Download file; the reports on the original MSOOX document are here Download file, and the better reports on the resaved MSOOX documents are here Download file. Comments below.
First, a few words of caution. First, neither ODF nor MSOOX are completely finished or stable; the numbers may be different in 2008. Second, this is only one document from one provenance; the numbers may be different with the documents are entered native or come from different sources. Third, the files are the products of software, so to some extent they test the applications rather than formats per se; the numbers may be different for different applications. Fourth, the version of Word being used is a beta and some parts of Open Office are also probably immature: DOCBOOK export failed for example. (So to some extent this is a test of how some beta software produces data in a beta format, done to beta-test a utility using some beta metrics…I will update the article if any errors are found.)
Opened in Open Office, the document is about 736 pages. In Office 2007, the document formats at 732 pages. It doesn’t seem a significant difference.
For load times, I logged off and logged on again to ensure a fresh session. I opened the applications and used the open menu rather than double clicking, so that application load time was not involved. For Open Office, the .SXW and .ODT files took about six seconds to load each (this was quite load depdendent: at another time I noticed the same document taking about 14 seconds to load; I believe this may be due to Open Office being paged back into memory). For the Word 2007 beta, the (resaved) .DOC and .DOCX returned their initial page display faster than that: however the rest of the file loaded in the background and loading took about 35 and 45 seconds respectively.
Comment It seems that consideration of file loading needs to be slightly more nuanced than simple times then: if you count to when you first see some text, Microsoft was much faster; however, if you count from when the document is fully loaded, Microsoft was significantly slower.
Here are the file sizes:
- .SWX (original):434K
- .ODT (ODF 1.0): 438K
- content.xml (ODF): 4,383K
- .DOC (MS): 4,432K
- .DOCX (MSOOX): 764K
- word/document.xml (MSOOX): 7,775K
- .DOC (MS resave): 3, 142K
- .DOCX (MSOOX resave): 733K
- word/document.xml (MSOOX resave): 7,472K
Comment For a large files, the .ODT file is much smaller than the equivalent .DOCX file. This can be almost entirely attributed to the relative sizes of the XML files: the ODF XML file is much smaller than the equivalent MSOOX XML file. However, the differences in those files sizes are dwarfed by the difference in their size compared to the .DOC size which is five to ten times lalrger. Resaving the .DOC file resulted in approximately a 25% file size reduction.
So here are the XML metrics.
Element and Attribute Count
|Total Metrics Value||428||245|
Comments For the same document, MSOOX and ODF seem to require about the same number of unique elements. However, MSOOX has substantially fewer attributes reqiured. (I will look further sometime, but I’d suspect that MSOOX is using richer data values rather than markup. It also seems that the ODF content.xml file contains more style information; both the ODF and MOOX ZIP structures have other files for containing stylesheets.) At a minimum, we can say that processing ODF and MSOOX will involve different tasks: they are organized differently, and if the extra attributes in ODF are indeed due to a finer grain of markup then we can say that some kinds of document processing using XML APIs will be easier using ODF.
Field Count Metric
The field count metric here is a version of the field count metric presented in the blog before. The original metric required knowledge about which attributes were IDs, xml:space or other metadata, which requires a schema, annotations and perhaps some hand-counting. The metric here shortcuts matters by saying that the first attribute in each element is not a field.
|Number of Elements with Data Content (excluding Whitespace)||44213||90743|
|Number of Attributes (excluding First Attribute and Namespace
|Total Metrics Value||57246||121543|
Comments The MSOOX numbers are about double those of the ODF. The reason for this seems to be that MSOOX uses an element value rather than attribute valye for style information and something mysterious Bin64 encoded data called “fldData” (field Data) which are used for almost every chunk of text. I included this metric because I was concerned that Microsoft’s highly nested style might inflate its document complexity metric, based on tests with tiny documents, but it turns out not to be the case.
Document Complexity Metric
|Required as First Child||26||23|
|Total Metrics Value||582||360|
Comment According to these numbers, the ODF document is more complicated than the MSOOX document. This in part reflects the use of generic elements rather than specific elements, and as mentioned it may relect a tendency in MSOOX to do more using rich data values rather than explicit markup.
Weighted Document Complexity Metric
The Topologi Complexity Detective allows you to weight various factors to reflect the experience in your organization, when deriving metrics for cost or time estimation in projects. The following weighting is one such set, based on a particular client’s experience for a certain kind of task.
|Required as First Child||1||26||23|
|Total Metrics Value||-||842||550|
Comment According to these numbers, the ODF document is more complicated than the MSOOX document.
What do these numbers mean?
The trouble with metrics, of course, is that people bring their own presuppositions to interpret them. The numbers are rarely univocal: if there are a large number of unique elements, for example, does this mean that they are really targetted and rich, or uncontrolled and sloppy with overlap, or confused and inelegant, or easy to read since the names are clearer, or more difficult to remember, or more difficult to use since they may all have different usages?
Where metrics are strong is that they do show us where things differ from our expectations. They provide a kind of objective evidence that let us identify anomolies: in this case, the difference in field count lead me to look at the fldData attributes: it seems possible that the Word 2007 beta saves a lot of information in bin64 encoded form that ODF exposes as attribute values. Now I wouldn’t get too alarmed about this (this means you, Pamela Jones! :-) ) because MSOOX is not finished and it would seem to be to be very sensible implementation technique to progressively move over from binary to XML while the thing is under development. It would not be, of course, a good thing if this continued into the final standard (and the major implementations that exported to it.) If there is this progressive implementation going on, then that pulls the rug out from these metrics!
The numbers seem to support the interpretation that beta MSOOX may be quite a bit less complex than ODF 1.0 at this stage, at least in the sense of using fixed structures more, and simpler in these sense of using fewer elements and attributes. ODF is flatter and has smaller filesize but seems to include more style headers than the MOOX does. The metrics indicate that the use of attributes may be significantly different between the two formats, for example for people looking at data conversion estimation. On the application level, Open Office loads the ODT file much faster than the Word 2007 beta loads the DOCX file.
Finally, the fact the we have two (and presumably more) word processors that can produce and consume XML for a decent sized book, is such a great step forwards from a decade ago. A medal to both teams! Boiled down, based on these numbers (and I need to double check my thinking here, and this is completely blue sky!) I’d wouldn’t be surprised if MSOOX were easier to convert from (because of its regularity, scale and low complexity) while ODF were easier to convert into (because of its richness and flexibility), after the initial hurdle of converting anything to/from either of them was leapt.