On emulation and migration to intermediate format

by Gabriel David in Data Warehouses in the Path from Databases to Archives, (local copy) paper submitted for the database preservation workshop at Edinburgh, 2007-03-23


Organizations are increasingly relying on databases as the main component of their record keeping systems. However, at the same pace the amount and detail of information contained in such systems grows, also grows the concern that in a few years most of it may be lost, when the current hardware, operating systems, database management systems (DBMS) and actual applications become obsolete and turn the data repositories unreadable. The paperless office increases the risk of losing significant chunks of organizational memory and thus harming the cultural heritage.
Significant research addressing this concern has already been conducted. The conclusions discard approaches now considered naive like trying to preserve specimens of the machines, system software and applications, in all their main versions, so that the backups of every significant system could be used whenever needed. A variant of this, instead of preserving the hardware, suggests simulating the older hardware in newer machines. More promising research suggests the conversion of database contents into an open neutral format with a significant amount of semantics associated (XML dialects), so that it becomes independent of the details of the actual DBMS.

On citation of data and persistent identifiers

by John A. Kunze, in Practical Citation in a World of Evolving Data, paper submitted for the database preservation workshop at Edinburgh, 2007-03-23

The conduct of scholarship has long required the ability for authors to cite works that support, refute, extend, credit, or otherwise relate to the citing work. Traditional citation guidelines have been sufficient to create stable references for printed materials held in the world's libraries and archives, but common citation practices have not yet appeared for digital objects that change frequently, come in a variety of formats, and themselves consist of a hierarchy of citable objects. The absence of such practices is keenly felt in scientific research that relies on long-term access to large databases.

Investigation quickly shows that stable database citation is much more than a matter of superficial notational convention, but presupposes an underlying usage model that must be acceptable to data users, producers, publishers, and archivists. The model should encompass a variety of citation needs and synthesize prior discipline-specific efforts at data description in biomedicine, political and social science, astronomy, geography, etc. The model should also accommodate the modern, internet-age expectation that citations include persistent identifiers usable with widely available software to gain access to cited works.