you are not logged in

Navigation

User login

Position paper

MIXED: Migration to Intermediate XML for Electronic Data

Dirk Roorda, DANS

February 2007

The MIXED project addresses the problem of digital preservation of spreadsheets and databases. It does not tackle the problem in full generality. By imposing some restrictions we hope to achieve progress in actually building software tools that enhance the durability of electronic data. These restrictions stem from the fact that our effort is pointed to data coming from the social sciences, the arts and the humanities. This entails quite a few things:

  • we are not concerned with legal authenticity of the data
  • we will not preserve the insert/update/delete behaviour of databases; in fact, very little of a database application is in our scope; we only need to preserve the logical relationships expressed in the datamodel
  • we are interested in the continuous, online availability of archived data during the time it is stored (searchability, referencebility)

The method chosen is: migration to an intermediate format. This puts our method at the migration end of the spectrum between the broad archival approaches emulation and migration. We expect to overcome the basic disadvantages of migration by means of a suitable choice of the intermediate format. The basic disadvantage of migration is the enormous amount of work required to keep all kinds of data up to date with recent versions of the software that created it. The MIXED method decouples the data from any application needed for authoring that data. This yields a number of reductions:

  • instead of migrating a huge number of vendor formats, we have only one format to migrate to (for ingest) and to migrate from (for dissemination)
  • instead of a rapid sequence of versions of vendor formats, we expect a much slower rate of versioning of the intermediate format
  • instead of working under dependencies of vendors and the measure in which they disclose their formats, we work with open formats in which no detail is hidden
  • instead of working with formats that incorporate the temporary designs of applications, the intermediate format represents the semantic core of the data and nothing else

We expect that this reduction of migration work is so enormous, that the MIXED method hardly counts as migration, compared to naive migration strategy. As to the initial conversions from vendor formats to the intermediate format at ingest, and the reverse conversions upon dissemination, a few mitigating conditions hold:

  • these conversions do not have to be more durable than the formats to which and from which they convert
  • so, for these conversions we may pick existing software tools, if any
  • because of the development of open interchange formats for office documents, there are already many converters from vendor office formats to and from these interchange formats (Office Open XML and Open Document Format)
  • by chosing a format related to an existing interchange format, the MIXED method profits from the current (healthy) developments in Office-land

Things to be determined

One of the abstractions made in this project, is that we only want to archive the semantic core of the data. But what is this semantic core? Loosely speaking, it is the meaning of the data, independent of presentation and action. For example: a table of outcomes of censuses in the Netherlands during the 19th century is mainly interesting for its values, and the meaning of those values are not dependent on the font family (presentation), or on the logic (action) with which you enter them in a database. But this is too loose. Often, presentation is an aid to interpreting values, and cannot be omitted. And in a database, business rules are often expressed in action (constraints, triggers), so you might loose essential interpretations if you ignore it. But then: not all presentation details and action details do contribute to meaning.

In our MIXED project, we have to fix an operational semantics of the concepts of presentation and action in the context of databases ans spreadsheets. This operational semantics has to guide us how we deal with the presentation and action features present in spreadsheets and databases. We found a promising approach for databases in the description of the SIARD project, see the references below. Most of our thinking in this area still has to take shape.

Application

 We have two years and € 2,000,000 in order to develop a software framework that is able to

  • identify spreadsheet and database files, inspect live databases
  • extract data from those files (with or without user intervention)
  • wrap the data (and datamodel) into an XML dialect, let us call it M-XML, consisting of archival wrappers, plus wrappers for the typical database and/or spreadsheet structure, borrowing as much as possible from existing open standards
  • convert back data from M-XML to present day spreadsheets and databases

Moreover, at DANS we have an OAIS compliant archival repository, called EASY. At the moment, EASY has no facilities to fight the obsolence of aging data formats. MIXED has to deliver its software as a module on top of EASY in the first place, and possibly on top of other archive systems as well. The software will be open source.

As to the name M-XML: read Middle-XML, or Mixed XML, since it will be an umbrella under which several XML dialects will be subsumed, XML dialects that represent generic structures like those of databases, spreadsheets and documents, for which exist widely-spread, much-used, long-lived applications that act on that structure.

The MIXED process

The rest of the paper is devoted to making the MIXED scenario explicit. It contains a description of the context in which the MIXED method is meant to work. The intermediate format to be defined and used by the MIXED project is abbreviated by M-XML. We expect that even this format will change over time, so sometimes we are more specific, and say things as M-XML 2007, M-XML 2010, meaning the version of M-XML that is current at that year.

The main question is: what requirements must be posed on M-XML to which current spreadsheets and databases are converted for ingest, and from which they are converted back for dissemination?
Must all the conversions be preservable? Can we use existing programs to do the conversions? To answer these questions, a finer-grained view of the ingest and dissemination processes is needed.

The left column of the diagram lists data formats that must be archived. (This is short for: data formats in which there is data that is to be archived). Read data.vi.j as dataformat of vendor i in version j. Vendors come and go, new versions come, old versions become obsolete.

The right column lists data formats that must be disseminated. (This is short for: data formats for which there is archived data in M-XML that must be disseminated in that format). Read data.ui as dataformat i. These data formats may intersect with the data formats in the left hand column. Probably the set of formats at the right is a subset of the set of formats at the left.

In the middle you say the M-XML formats, in various future versions, hypothetically.

Let us reason about the development of M-XML in the coming years.


Situation 2007

Suppose a nice XML formalism has been devised for archival records: M-XML 2007. The data formats of the time, defined by applications of vendors v1 and v2,  in versions 1 and 3 respectively, can be converted for ingest by means of MIXED modules. The required data formats, say u1 and u2 (which are chosen from the currently available applications and versions) can be reached as well, by means of MIXED modules that convert from M-XML 2007 for dissemination.
So far so good.

The simple view is that MIXED has reached its objective, that an eternal archival format has been defined, and that, as time flows, only new conversion modules have to be added by MIXED.
So, in the simple view, what is the work that MIXED must do?

  • define the format M-XML 2007
  • write conversion modules from v1.1 and v2.3 to M-XML 2007
  • write conversion modules from M-XML 2007 to u1 and u2

The realistic view is that M-XML 2007 itself will undergo evolution. This is due to:

  • initial weaknesses of the format
  • the coming of new applications and versions with new features not supported by M-XML 2007.

Situation 2010

We start thinking in the realistic view. New applications, versions and features have arrived, for ingest as well as dissemination. The format M-XML 2007 has to be upgraded to cater for the new features: M-XML-2010 is born. The older applications and versions for ingest and dissemination are not yet obsolete. The things to do are:

  • write an upgrade conversion from M-XML 2007 to M-XML 2010; this is needed to ingest new material in older formats: the existing modules convert the material to M-XML 2007, the new upgrade conversion brings it a step further. (Obviously, we also could have written new modules to convert from the older formats directly to M-XML 2010, but that looks less efficient).
  • write a downgrade conversion from M-XML 2010 to M-XML 2007; this is needed to disseminate new material in older formats: the new modules convert new material in new formats directly to M-XML 2007, the new downgrade conversion brings it to the older dissemination formats.

It could very well be the case that M-XML 2010 only adds features, so that any document in M-XML 2007 also is in M-XML 2010. In that case, these steps amount to nothing.

  • write new conversion modules for ingest from v1.2 and v2.5, and for dissemination in u3 and u4.

Situation 2020

Still in realistic view, we assume that the intermediate M-XML format evolves quite slowly. This is because M-XML expresses the semantic core of data, and not all the frills needed by authoring applications. So here is the next upgrade and we do all the work we needed to do in 2010. We add a slice to the diagram, consisting of a new set of ingest and dissemination modules, and an upgrade and downgrade conversion  between M-XML 2010 and M-XML 2020.
But in a realistic view, an extra complication might occur: suppose that in 2020 the upgrade/downgrade conversions between M-XML 2007 and M-XML 2010 becomes obsolete. What must we do? It depends.

  • Case 1: the software tools that the conversions rely on, are no longer supported. Remedy: reimplement the conversion by means of current software tools.
  • Case 2: M-XML 2007 can no longer be supported, because of one of the following reasons: XML has undergone incompatible changes, UNICODE has undergone incompatible changes, or new software tools stumble over aspects of M-XML 2007. In this case we may assume that the old ingest formats (1.1 and 2.3) and the old dissemination formats (u1 and u2) are obsolete as well. The only problem then is: we have legacy material in M-XML 2007 that we cannot disseminate anymore. Solution: migrate all the legacy material in M-XML 2007 to M-XML 2010, or higher (the first version that is not obsolete).

The need to migrate legacy material is a pity. But:

  • the up-conversions are likely to be easy and straightforward
  • the conversion is between open, generic, well-documented formats
  • the conversion is between a single  source format and a single target format.

So, in 2020 a new layer is added to the diagram, but at the same time, the oldest layer is stripped from the diagram.

Situation 2050

The same exercise has to be repeated: a new layer is added, one or more old layers are deleted. Deletion of a layer necessarily involves migration of the legacy content in that layer.

Evaluation of this scenario

The realistic view is a lot more work than anticipated in the simple view. So it is time for the economic view. We try to show that the combination of the realistic view with the economic view is involves a manageable amount of work.

The economic view introduces new considerations:

  • there are already M-XML like formats for spreadsheets, and possibly databases (Open Document Format (ODF) by OpenOffice and OpenXML by Microsoft)
  • there are already many implemented conversions between these Open formats and the current vendor specific formats
  • it is likely that these existent conversions will be maintained
  • it is likely that when the open formats evolve, upgrade and downgrade conversions will be provided by the vendors/user communities

These considerations must be combined with the following principle:

In order to ensure digital durability, only the up-down conversions between the different versions of M-XML have to be durable themselves, and the degree of their durability may vary.

It is a matter of common sense and careful evaluation to determine where we have to ensure durability. Here are some preliminary observations:

  1. the ingest conversion modules do not have to be more durable than the period during which datasets in the corresponding formats are to be ingested
  2. the dissemination modules do not have to be more durable than the period during which datasets in the corresponding formats are to be disseminated
  3. something analogous holds for the upgrade and downgrade conversions

It follows, that it is safe to use the existing Open formats plus their related conversion software, without compromising the essential durability of the stored archival records, provided the Open formats have enough quality in terms of genericity, self-documentation, and support from the computing community of the time. So what is, in the economic view the work to be done by MIXED? The following parts can be identified:

  1. select carefully the most suitable Open format of the present time, and make the criteria explicit. If there is no suitable Open format, develop or modify one. The latter case might hold for databases at present.
  2. select carefully the best convertors between the selected Open format and all relevant existing formats
  3. monitor developments in the open format and current formats
  4. manage all the needed conversions by means of wizard-software: given a datafile, the wizard must be able to determine its format and call the needed conversion
  5. integrate the needed convertors and the wizard: if the convertors are implemented in different programming languages, run on different platforms, have incongruous APIs, try to re-engineer them into one programming language (JAVA newest version), with a consistent API, as platform independent as possible

The simple and the realistic views require that we do things that already are being done. Given the work done to define ODF en OpenXML, given the work done in the conversion filters used by Open Office and Microsoft Office and others, it is unclear how we can ever compete with that. We could add a nice and clean Open format, with a view nicely engineered software modules, but we think this effort will not have an appreciable influence on the flow of the software engineering in Office-land and database land.

The economic view harvests the power that is already there. It tries to live with the imperfections and lack of harmonization of the different existing modules. It fits more closely to current document creation and processing. It envisages a practical solution for digital durability of archived records. It does not a create a small flow alongside a big flow, but it tries to enhance the existing flow in quality and direction.

This approach of wizarding might have another consequence. Currently, the SPSS format is out of MIXED's scope. Yet it is a very frequent and natural carrier of research data in the social sciences. There is already an initiative, DDI, working on transparent, self-documenting and durable formats for SPSS files. The tools between this format and current SPSS versions are not quite mature yet. It might be worthwhile for MIXED to bring SPSS and this DDI format under its wizarding umbrella. Moreover, MIXED could contribute to the development of the conversion tools to and from DDI.

Characterization

If we follow the economic view, MIXED will not spend its money to create a brand new jewel in the field, with the apparent risk that it will only reach the museum. Rather it will contribute to current practices, that are already strong, but not yet up to user friendly preservation of archival records. We expect that the latter approach leads to a situation where DANS and other archives will have a practical way to add preservation to its records as a matter of routine.

Positioning

We saw that the MIXED scenario involves migration at some points in time. So, are we implementing a migration scenario? Had not we the intention to implement a solution based on an object interchange format, making migration superfluous?
I think the MIXED scenario as proposed here, takes into account that the ideal of an persistent archive is not yet fully achievable. We simply have insufficient insight into how standards, interchange formats and the like evolve over very long times. Moreover, giving the fact that hardware and software is developing fast all the time, it is risky to extrapolate patterns seen in the previous half century into the next half century. Cyberspace itself might have changed in unforeseen directions.
So we opt for a scenario that is firmly rooted in the present, that takes on the direction of the ideal of a persistent archive, and that calculates what has to be done as long as the ideal has not been reached. In doing so, it tries to minimize that work as well.

References

Archiving in general

DANS

dans, MIXED project

OAIS

Standard about best archiving practices

Auxiliary standards

ODF

Open Document Format (native format of OpenOffice, also: interchange format) (oasis, Cover pages, Wikipedia). Contains as sublanguage: (OFL ) Open Formula Language.

OpenXML

Microsoft Office 2007 native format and interchange format (ECMA, OpenXML Developer, Wikipedia ; overview online ; Open Packaging Convention (OPC) online 

Comparison ODF and OpenXML

Wikipedia

Auxiliary resources

PRONOM

Format repository. Lists information about formats of digital information: applications, vendors, versions (main site). Contains DROID (Digital Record Object Identification), with software tools

National Library of New Zealand Metadata Extraction Tool Version

Extraction of Metadata from digital files (main site), with software tools

JHOVE

JStore/Harvard Object Validation Environment. A framework for determining the formats of digital objects. online

Related DANS projects

EASY

system, project documentation

Related projects out there

Testbed digitale duurzaamheid (test bed digital durability)

SIARD

Fedora

Rich objects archiving through Web services online

TOM

A networked service to document information types. Moreover: a system of conversions between information types. Clients can look for conversion services through type brokers. A completely different approach to digital durability than MIXED tries to implement. Promising, though. TOM

Planets

A fairly recent initiative to bundle archives and software companies around the theme of digital durability. Looks like Planets has inherited SIARD. Planets

KB

Koninklijke Bibliotheek (Royal Library (Netherlands)). Several reports, jointly with IBM. Specific interest in emulation techniques. online

Digital Preservation Coalition (DPC)

Website paper Mind the Gap. See also the list of allied organisations on the site.

Articles, journals, reports

Digital curation (journal)

DC

The State of Digital Preservation: An international perspective

online

Digital Preservation and Permanent Access to Scientific Information: The State of the Practice. CENDI - 2004-3.

online


Possible migration risk

A possible risk that apparently does not belong to the current MIXED project, but possibly in a future Mixed-2020-Migration project is: Migration from M-XML-2007 data to M-XML-2010 data, due to obsolete affects of upgrade/downgrades between X-XML 2007 and 2010, can be considerably time-consuming.....and of course also extra disc-space is necessary (i.e. I presume the archives of MIXED will be in the order of Terabytes or more…..).

definitions

In 2020 you state the possibility that "XML has undergone incompatible changes". Will we still be talking about XML at that time or will XML have been overruled by another format? If I ignore the title of the project: is this project about XML or about an intermediate format (that is currently an XML-format)?

Another point is about the definition of "data". In contrast to "documents" this definition is clear, but in contrast to "information" or even "knowledge" this definition might raise questions.

(Although I understand what you are trying to say I'd like to share these considerations.)

still xml, more about data

Even if XML undergoes incompatible changes (say from 1.1 to 2.0), it remains XML. It is not in the scope of the project to invent something more general.

The word data is ambiguous indeed. What we want to preserve is information, in the form of datasets. The designated community (OAIS term) knows how to interpret the data, provided we have marked up the data in such a way that the knowledge of the designated community is sufficient to decode the marked-up data. So, do you call this marked-up data information or still data? I have no strong feelings about that.

Your question gives rise to thinking about a slightly more general point: suppose we want to archive datasets that are already in XML, but a more special purpose XML, such as a well-marked-up text corpus, or law texts, marked up in Kluwer XML or SDU XML or another kind of legal XML. There are hundreds of such domains, each with their developing and rivalling XML interchange formats. What has MIXED to say about this?

I think the approach of MIXED is as follows. Where there are data(sets) with a generic structure, for which there are strong applications, it is the task of MIXED to decouple such applications from the data, by wrapping up the data in XML in such a way that the wrapping XML reflects the generic structure. Examples are: databases (structure = table - column - value - constraint; application = database management applications such as Oracle, SQL Server), spreadsheets (structure = sheet - row - cel - value/formula; application = Excel, OpenOffice and others), documents (structure = section - paragraph - formatting etc; application = Word, OpenOffice and others), spss reports (structure = variable - value - relationship; application = SPSS).

In all cases with generic structure and strong application acting on that structure MIXED tries to find the best (most stable, usable, cost-effective) interchange format in XML. It will document this format, find and manage converters for it, and maybe even develop simple viewers for it (in order to deliver search results). So, the M-XML can be read as: Middle-XML or Mixed-XML. It acts as an umbrella over existing or new XML dialects that mark up generic structures.

In all other cases, MIXED does not do anything special, those cases are out of scope. When we have datasets in XML, tailored to a specific domain, an archive can better store the data as is, because the designated community will maintain the knowledge to interpret that XML. Often, the applications for such domains are not strong, in the sense that they are widely-spread, long-lived, much-used. The semantic structure of the data, marked-up by the XML dialect in question, is the prime thing to be preserved. Because the applications that act on this structure are weak, it is not so important to preserve these applications or their behaviour, and there is no strong coupling between application and data.