The MIXED software developers of DANS started tackling some binary formats, among which MS Access, dBase, and DataPerfect. Right now, MIXED is emerging out of a fairly long period of sustained efforts to develop pure Java code that reads and writes these binary formats. They found some old libraries to start with, but it was necessary to examine and rewrite most of that code. All MIXED software will be released as Open Source. We already started to publish the libraries for reading and writing the dBase format, also known as DBF. Have a look at SourceForge.
Is ODF optimal for preservation, or should we turn to ... CDF? What the h... is CDF? The backing parties for ODF (OASIS and then ISO standard) do not keep themselves united. ODF itself is too much tied to a particular application, OpenOffice, which is not completely free from the influence of a vendor, SUN. CDF is an as yet obscure W3C format for XHTML documents with embedded portions in other formalisms, such as SVG. There are no applications associated to CDF. Well, now, before we change the course of MIXED, let us think for a little while. Is this yet another hype? What new format will crop up next month? Is it good that CDF is not (yet) polluted by a strong application? It might be ideal for preservation purposes. But will it help us to preserve spreadsheets? Anyway, for MIXED we will use a subset of ODF. This subset will be small enough to contain just the parts of a spreadsheet that are meaningful for preservation. If in the future ODF dwindles to nearly non-existence, it will not be a big hassle to migrate from this subset to the new kid in town.
A new organization around ODF (Open Document Format) has emerged, the ODF alliance. The first and foremost intention expressed at their website concerns document preservation. I consider this alliance as a new move in the rivalry between Microsoft's new Office Open XML standard, and the more mature ODF standard, used by, among others, OpenOffice. Could it be that document preservation is now considered a strategic issue by software vendors and their standard makers? For if your format supports preservation, you are assured of continued usage of your standard. Anyway, it is a good thing that awareness of preservation issues is in the rise, and that we are witness of a gradual move of vendor specific document formats to standard interchange formats towards preservation formats. By the way, the home page of the ODF alliance points to a concise article by Google (local copy) comparing OOXML and ODF in their quality of good honest standards. Whatever the bias of Google might be, the arguments they provide here are substantial.
The Open Data Foundation is concerned with access to statistical data. This fits nicely with MIXED's task to ensure access to tabular data. René van Horik and Kris Klykens will be on expedition to see how we can join forces. Whereas MIXED confines itself to tabular data coming from spreadsheets and databases, there is another class of data, also tabular, that comes from statistical software packages such as SPSS. There is an ongoing effort to model this kind of data in XML, called DDI. DDI is moving into a major new version, DDI 3.0. In MIXED we define an umbrella format, under which several standards are selected. Ideally, for every kind of data we should have one standard which models the preservation format for that data. Currently, MIXED selects and creates standards for spreadsheet data and database data. It seems that DDI is the candidate for statistical data.
Because we build MIXED as a framework with plugins, the MIXED software is able to lend its framework capabilities to many different standards under the umbrella. We have already found a partner that can write plugins for conversions to and from DDI. So we can extend the MIXED umbrella with a significant contribution. It also fits very nicely with the scope of DANS: with DDI support we will have something substantial to offer to researchers in the social sciences who want to preserve and reuse their statistical data.
Look at that: the National Archives want to migrate old documents to a format suitable for preservation, and they chose Open XML! (read more here or on a local copy). Here at MIXED we discuss whether our M-XML should be an entirely new format, or a variation on ODF or Open XML, or possibly identical to one of the two. We are opting for ODF, because this is a real standard, considered the way it came to be. How far we shall deviate from it remains to be seen. Not all features are relevant to preservation, and inclusion of all feautures might increase the update frequency of M-XML itself. Yet the wealth of convertors to and from ODF make it very attractive to just choose for ODF.
The underlying question is: clearly, the natural tendency for Office documents is towards a standardized representation of their contents. These standards are fairly recent, and are they already good enough to adopt for preservation purposes? Or will they never stabilize enough to become a worthwhile representation for preservation? My view is that it is better to choose for a popular standard that is not completely tailored to preservation, than to develop a new standard, that is perfectly tailored to preservation, but never used. My reasons are as follows:
- popular standards have a rich set of tools around them
- when one popular standard supersedes another in popularity, there will be tools around to migrate from one to the other
- digital preservation is a growing concern, it is to be expected that the preservation concerns will increasingly be addressed in those standards
- even if we try, we cannot exactly demarcate what is worthy for preservation and what not; some prefer formatting to be preserved, others formulas, and then again some want visibility and protection of cells to be preserved ...
So the fact that Open XML, even if not our preferred standard, is going to play a role in digital preservation, is very good news for MIXED and for the practice of Digital Preservation.
Shall we choose short element names for the markup of rows and cells? If we do, we save considerable space, because the bulk of the material of our databases and spreadsheets is in cells. Moreover, quite often the cell contents is tiny: a floating point number, or a short string. There are signs that Microsoft Office Open XML struggled with performance issues because of the change from a binary format to XML (see here). On the other hand, we loose some self-explanation of the M-XML format if we do so. Most XML standards that we are aware of, do not use any form of tag economy. You might say: by all means, explain your elements in a separate document, but not in the data. But then you could compress your whole XML file by choosing element and attribute names like a, b, c, a1, b1, c1, etc. This is against the grain that XML tries to do: making structure explicit in ways that helps humans and machines to read it. Moreover, it remains to be seen how acute the size problem will be. For the moment, we go with the XML flow: we choose descriptive names, even for rows and cells. If the performance penalty will hit us hard, we can always selectively reduce the length of some element names.
In MIXED we have an architecture for the software we are going to build, and it is a service oriented architecture (SOA). Quite the opposite would be a one-point application. Why? Let me tell you about tensions it evoked in the MIXED team to get our picture straight.
The basic idea of MIXED is simple: it is a conversion engine, which can handle a set of file formats. A main application with plugins for the filetypes would do the job nicely. The code remains simple, the efficiency of adding and calling plugins will be high. Highly appealing to the software team. But not to our architect... He came with a design consisting of services only, where not only the conversions where moved to plugin, but also application internal tasks, such as high level control, logging and configuration. That seemed rather heavy, inefficient, over the top.
The way to get all noses pointing into the same direction was: back to the requirements. When will MIXED be a success? If we have build a conversion tool? We need to ensure that the conversion tool is also a preservation tool. For that several things must fall into place: that our intermediate format (called M-XML) will be adopted by others, that the conversions to and fro are reliable, that there are conversions for many filetypes. This we cannot do alone.
Who is going to develop all those conversion plugins, now and in the future? In order to get good quality conversions we must be able to use the work of others. If we want M-XML to become a success, we must encourage others to use our work. In the end there will be many MIXED software instances running, using many internal and external plugins.
Seen from this perspective, MIXED needs that Server Oriented Architecture. It should be a service talking to other services. It will fit nicely on a computing grid, offering a preservation service to be applied on data that is already on the grid.
While we are thinking about SPSS, there is a pilot project DExT in Colchester, England, that also wants to preserve by converting to an intermediate format. They have SPSS and other statistical packages in scope, but not tabular data in general. Currently, DExT is not aiming at a full-scale software implementation, only a proof of concept. Wouldn't it be nice if they could provide plugins for SPSS and other favourite formats of theirs, and we provide the general framework for that plugin. Here is the beginning of synergy ...
In MIXED we restrict ourselves to tabular data, because we want to preserve research materials. The most well-known carriers of tabular data are spreadsheets and databases. So, let them fill the scope of MIXED. At the same time, our organisation, DANS, works for the social sciences. A lot of research data there is not in spreadsheets but in statistical software, notably SPSS. This is also basically tabular data. If we leave out SPSS from our scope, we loose opportunities to serve the social sciences. If we allow SPSS, we have to cater for it in our intermediate format (M-XML), which might not be easy.
We could reduce M-XML to a mere umbrella format, with separate unrelated formats for spreadsheets, databases and SPSS under it. In that way, we could define those individual formats in such a way that we can reuse a lot of conversions that already exist. On the other hand, it would be nice to have a generic format for all tabular data, irrespective if it originated in a spreadsheet, database, or the SPSS program. It would facilitate aggregation and searching.
In our team we have oscillated between the various options: yes or no to SPSS? Yes or no to deep integration of the formats? Up till now, we are not decided.
Submitted by dirk on Thu, 2007-06-28 15:32.