you are not logged in

Navigation

User login

Primary Data Types and Data Normalisation

The Problem

DataPerfect databases have a non-standard way of typing fields. Instead of types such as integer, float, varchar, datetime it has masks that act as patterns for the character strings that make up the field contents. There is no easy mapping between these masks and SQL database types. Even worse, even if you could define such a mapping, data loss might ensue. This is how: say a DataPerfect mask states that a column named telephone is a field consisting of 10 digits. This is a mask that should convert to a numerical SQL database type by any logical mapping from masks tot types. But leading zeroes are not significant for integers, and might even be forbidden in other databases, or lead to octal interpretations in other systems, so leading zeroes must be stripped from numbers. But if the field telephone is used for telephone numbers, stripping zeroes leads to incorrect information.

Stated more generally: we have a twofold problem:

  1. what to do when the types of database fields for a certain database implementation do not straightforwardly map on the SQL datatypes?
  2. regardless whether we find a good type mapping or not, is it wise to normalise the values found in the fields according to the target types?

Approaches

Two approaches to this problem will be discussed, the normalisation approach and the conservation approach.

The normalisation approach

This approach tries to normalise data whenever possible. The principle is: it is good to change the syntax of values as long as the semantics stays the same. If a database has a very idiosyncratic way of dealing with numbers, we map those numbers to integers in a normal representation compatible with SQL. Other example: if a text field contains the string "3<4" we translate it into SDFP as "3&lt;4" or <!CDATA[3<4]]>, values that translate under every XML representation to "3<4".

When the datatypes in the source database do not fit exactly the datatypes in SQL, you take the closes match, and convert the values found into canonical representations for the new datatype.

The conservation approach

This approach does not normalise data in any case. The sole exception is that if data has to be packaged into an XML file (e.g. an SDFP file), the necessary escapes are applied in order to get well-formed XML. This is not really a normalisation, since under every XML representation the original string will be yielded.

So, in this approach, the data itself does not change. What does this approach do with different datatypes in the source and in SQL? There is a spectrum of options. A trivial option is: convert every source datatype to varchar of unspecified length. This type will accept all unchanged values from the source. At the other end of the spectrum we can map all source datatypes to their closest SQL counterparts, as long as all source values are valid values in the new type without changing their representations.

Pros and Cons of the approaches

One of the expectations around MIXED has always been that if you put your datasets through MIXED, you can aggregate them easily. In other to make this expectation come true, data normalisation is the least thing you can do. If you want to aggregate two series of events, coming from two different databases, each having a field that indicates the time of occurrence, and if you want to sort the aggregated table on this field, it helps that all values use the same date-time representation.

Here we deal a plus to the normalisation approach and a minus to the conservation approach.

Another expectation around MIXED has always been that if you put your datasets through MIXED, you can interprete the data for ever, regardless the software applications of the moment, and that this interpretation is fixed: interpreting them now yields the same values as interpreting them five centuries later. Here the normalisation approach shows a definite risk for misinterpretation, as shown above. The conservation approach realises this objective of MIXED, provided the documentation of how the values are to be interpreted in the source database is preserved as well.

Here we deal a minus to the normalisation approach and a plus to the conservation approach.

Weighing the pros and cons

Correct interpretation of values is more basic than the potential for aggregation. We cannot accept a solution in which the preservation of semantics is not guaranteed. This points to the conservation approach. Indeed, this is what we decide: we convert according to the conservation approach. But wait a moment, is this the best we can do? Is not it possible to get the best of both worlds? Yes we can, and the method is: conservation plus controlled normalisation.

The solution

Conservation first

Each MIXED conversion plugin from a custom databaseformat to SDFP will follow basically the conservation approach. That means that the representations of the values are left untouched. As a first step towards normalisation, it is encouraged that a plugin tries to map the custom datatypes of the source to close analogues of SQL as much as possible. The limiting factor is that the values of the source type must fit into the target type without syntactical modification. Several additional measures will be taken in order to make the semantics of the values explicit:

  1. the name of the original datatype will be copied to the datamodel in SDFP; it will be stored into the element validationRule
  2. the documentation of the original datatypes will be made accessible from the MIXED website, and we recommend that any archive using MIXED stores that documentation in its own repository
  3. the provenance data will indicate that the values have been conserved by the conversion; this will be done in a special provenance field, to be specified

And then controlled normalisation

One level of normalisation might already have been achieved by a plugin through careful mapping of source types to target types, but the mileage so far may vary depending on the nature of the source types. In the dataperfect case, we cannot do much more than map every type onto varchar.

Now the conservation approach has produced something valuable: an XML (SDFP) file with the original values readily at hand, plus the information what the original data types are. This suggests that me make normalisation conversions, starting with SDFP and generating SDFP, which map source types more closely to destination types, adapt values to their new types, and run the risk of misinterpretation. These normalisation conversions will modify the part of the provenance data that says that the values have been conserved; instead, the provenance data will now reveal that the data has been normalised.

This normalisation step is not a conversion plugin itself, because conversion plugins go from source databases to SDFP, or the other way round. Moreover, normalisation is not really bound to specific source databases. It is likely that a normalisation of numbers that works for DataPerfect, also works for MSAccess. It is even better to perform all possible number normalisations in one step, number conversion, which accumulates all intelligence in dealing with numbers. And likewise for datetimes, and whatever other type is relevant.

Using normalisation as a building block, it is possible to create several versions of conversion plugins: the bare version without normalisation, and dressed up versions with a number of normalisation steps added to it. Every user of MIXED may decide whether to use the bare version or the dressed up version, or both. The securest practice for an archive will be to store the original, plus the result of a bare conversion, plus the result of a dressed up conversion.

Additional remarks

It is a good thing to separate the concerns of conservation and those of normalisation. Conservation is about freeing data from the vendor formats it is encapsulated in. This efforts divides naturally in conversion plugins per database format. Normalisation is about changing forms while preserving semantics, which divides naturally into conversions per datatype. There are more reasons why the separation works beneficial.

MIXED intends to assert the quality of its conversion plugins. This is not trivial. But it is a lot easier to check the correctness of a plugin that confines itself to conservation, than to argue for the correctness of a "smart", normalizing plugin. We now can build strict tests to convince suspicious users of the conservation of information by the plugin. Once convinced of this property, the user might accept the weaker tests concerning the normalization steps. And if he is not convinced of the correctness of normalisation, he is in a good position to come up with a better normalisation, or not to use normalisation at all.

The project initiation document does not foresee this separation of concerns. It focuses on freeing data from vendor formats in order to make it reusable. It does not prescribe normalisations on the datatypes. Yet, for the sake of reusability, we must normalise to a degree that is reasonable, considering the source data and the amount of effort required to program these normalisations.