The participants have made a lot of comments, in the speakers' corner, through the subgroup presentations and during the breaks. On the last day we had a talk-back and wrap-up session, in which Dirk formulated the lessons learned, with the proviso that this was only a first reaction.
Things that clarify matters
When we explained the vision behind MIXED, we got some useful comments back:
- the three manifestions of a file in an archive: original format, facsimile format (such as PDF), reuse format (such as XML)
- the intermediate MIXED format (SDFP) aims to be a reuse format. It wil probably not cover all aspects of digital preservation. But in a research context where you preserve in order to reuse, the two functions nearly come together
- re-use versus dissemination
- the classical way of providing archival content is dissemination, where there is emphasis on either the original format or a facsimile. But archives that facilitate reuse, e.g. for data mining, need an other emphasis, and this can be facilitated by a separate reuse manifestation in xml
- smart migration = normalisation + migration
- this is a very slick way to state the MIXED approach: the normalisation step transforms all relevant current vendor formats into one XML format; after that only this XML format will be migrated to newer formats as time passes. The important thing is that the normalisation step does not have to bridge time, and the migration step does not have to worry about many different formats.
Things that puzzle us
There was some feedback that we could not immediately make sense of. In one case we did not see how it fell in the scope of MIXED, in another we could not see how to follow it up, and in yet another the feedback itself did not point in one consistent direction.
How to connect with the world of metadata and RDF?
Several issues here.
- The preservation of metadata is in itself a significant problem, in some cases even more difficult than the preservation of the data itself.
- Here MIXED should be firm in its course and scope: it deals with the problem of non-transparent file formats as barriers for preservation and reuse of data. It surely is not the only problem, it might not be the most fundamental problem, but it has to be taken care of.
- An other question is whether the MIXED approach itself can be used to deal with various metadata formats and schemes.
- If data have been made transparent by means of MIXED, it becomes feasible to go one step further: assign descriptive metadata to the data at the level of rows, columns and fields (in the case of tabular data).
- That is one of the benefits of MIXED-like conversions indeed: it is possible to connect data values in fields, cells or columns to semantic characterisations that have been established in registries and/or standards.
- For example, if a column contains temperature values in Kelvins, this information could be coded by means of a persistent identifier of a data category "temperature", which specifies the units and the constraints on the values and the meaning of the values. How this will integrate with the SDFP format remains to be seen.
- Often, heterogeneous datasets are meaningfully linked by RDF metadata. The MIXED approach could make this linking easier and stronger, e.g. by admitting links to fine-grained parts of the data.
- What metadata is generated and used by MIXED itself?
- MIXED needs information on file types, so it is a user of the MIME type registry.
- MIXED implicitly uses a notion of data kind (spreadsheets, databases, statistical files, images, editable texts). It would be nice if we could register those kinds somewhere.
Can we use existing tools for quality management?
The question is: if MIXED has performed a conversion, how do we assess the correctness of that conversion? We received two suggestions: the Universal Numeric Fingerprint (UNF, developed at Harvard) and the The Extensible Characterisation Language formalism (XCDL, developed by PLANETS).
The UNF only works for plain representations of data (text, XML). So it cannot directly be used to assess the equivalence between data in its binary format and in SDFP. But it can be handy if somehow there is an independent export from the binary format to text available, and it could also help checking chains of conversions from XML to binary to XML.
XCDL has to understand the binary formats it works with. The question arises: should MIXED and the people behind XCDL work together, since they are dealing with the same binary formats? The disadvantage of too much cooperation is that the XCDL test will not be independent enough.
Should SDFP be given a high or a low profile?
Some say: my organization is only interested in adopting SDFP and supporting converters to and from SDFP if it has a high profile as an important preservation format.
Others say: it is very hard to reach a well-defined, authoritative set of unique preservation formats under one umbrella.
Again others say: exert a force in the direction of standardization, and see how far you can get, there will be a niche, and may be a growing one.
We at MIXED think as follows: so far we are only dealing with tabular data. There is a fair chance that we reach a standardisation there, especially when we use an existing standard, ODF, for spreadsheets. But in the end the most important thing is that MIXED results in a bunch of good converters from binary formats to well-defined XML formats and back, and that these converters are clearly visible, easily testable and deployable by archives. A strong SDFP would help, but is not the only means to that end.
Things we most likely will not follow up
At this stage nothing is excluded ...
Things that we would like to act on
There were a few clear messages of the form: do this! Here they are, in the order of the priority we would like to give them.
Interoperate with PLANETS
PLANETS is building a testbed and registry for preservation tools. Have MIXED be registered there, and wrap it up such that it can live on the testbed. We must consider whether the MIXED application as a whole, or individual converters will be added to the PLANETS testbed.
A good way to start is to participate in a relevant PLANETS workshop, where developers from both parties can meet each other. And we should have a more in depth exchange with the Swiss Federal Archives. They are preserving databases in a way that is very congenial to MIXED, and we can learn a lot from their method.
Act locally, think globally
The local level is where you can act: with a local team of developers, for a locally needed use case. The global level is where you can optimise, interoperate, reduce work. That means for MIXED:
- we will implement MIXED for DANS, so that we have at least one functioning incarnation of the whole idea
- we try to get MIXED mentioned in the Dataseal Of Approval assessment procedures in relation to preferred formats, so other archives get an incentive to use the MIXED approach
- we will interoperate with PLANETS
In order to interoperate on the software development level, you have to document the code, the capabilities of the plugins (in a machine readable way), and SDFP itself. It would be good to have a permanent SDFP evangelist on the project.
As to the documentation of the code: we have made some libraries available via SourceForge. These libraries are the pieces of software that "understand" the respective binary formats. The way of publicizing this software is still experimental. We are open to suggestions.
As to SDFP: we need to do a significant amount of work here, which will be undertaken soon. We do not aim at the level of documentation of TEI, but we will look at it for inspiration. As to the evangelist: it remains to be seen whether René or Dirk will take up that role.
Cooperate with format registries, in particular PRONOM ad UDFR
Format registries are repositories of knowledge about file formats and their use and their recognition. Currently we have PRONOM and UDFR, in a state of increased interoperation, if not merging. It is logical that MIXED interacts with these registries:
- by submitting file types, such as the versions of DBF that MIXED has tackled
- by submitting requirements, use cases, specifications, whishes concerning the file type recognition process
- by setting up information harvesting from such registries, and maybe also vice versa: that the registries can inform themselves about the capabilities of MIXED
MIXED itself is a result of gap analysis: we felt a digital preservation need for our own archive, and could not find a satisfactory solution yet, so we started to develop a solution, in such a way that we expect it to be usable by others as well. But we should continue to look for gaps, maybe smaller gaps as well. Here are some tips we have got:
- find an interesting research project where re-use is facilitated by MIXED
- find a case where the use of MIXED provably reduces cost (either of preservation or of re-use)
- find a repository that needs to develop a format converter, negotiate a way of working together
Some participants reported contacts with software vendors about the concern for preservation of data created by means of their applications. Vendors are getting more open to this need. So it is worthwhile to engage in discussion with them.
It remains to be seen whether MIXED as a project, or DANS as a repository, or PLANETS as a community is best suited to undertake this.
The question "what would you do if ..."
We asked the participants the question: what would you do if you had some resources dedicated to MIXED, or if you were project leader of MIXED? Here is a collection of (partial) answers.
- Amir Bernstein (SIARD): we are interseted in the spreadsheet part; we want stand-alone systems for production, not webservices; we have solutions for BLOBs in databases; we want to exchange XML schemas; also: take care for an online service where people can test the MIXED converters
- Ellen Kraffmiller (Harvard, DataVerse)): we are interested in the spreadsheet converters
- Rob Grim (University Tilburg): the forming of standards in data formats is most relevant for me, as witnessed by the SDMX approach
- Marc Kemps-Snijders(CLARIN): if in CLARIN there is a need for spreadsheet preservation (through its outreach to the humanities), MIXED could be positioned on the CLARIN infrastructure, it could then enhance the preservation discussion inside CLARIN
- Alison Heatherington (National Archives UK): the preservation of spreadsheets and databases is on the agenda of the NA; also engage in collaboration on the future requirements of UDFR
- Rainer Schmidt (PLANETS): put MIXED into touch with relevant parts of PLANETS
- Sebastian Rahtz (Oxford Text Archive): interested in conversions between OOXML and TEI; in the text world there are a lot of "evil" legacy formats; pin down what it means to build a converter (requirements); think about the precise open source licence you want to distribute your software
- Vladislav Makarenko (eSciDoc): MIXED could be "bundled" with eSciDoc as a standard conversion utility
- John Doove (SURF): SURF is an ideal platform for use cases to come up (gap analysis), see the Knowledge Exchange event in 2009 September 23-24
- Jeroen Rombouts (3TU): interested in building one converter or two; find out if there are viable intermediate formats in the autocad world
About the launching conference
Near the end of MIXED we plan a launching event. Any ideas?
- do not let it be a funeral! (Amir Bernstein)
- position MIXED in the lifecycle of information (Rob Grim)
- present concrete cases (Jeroen Rombouts)
- possibly elements from Kees Mandemakers's project Life Courses in Context, which is DBF data
Submitted by dirk on Mon, 2009-09-28 09:10.