you are not logged in

Navigation

User login

Social science data: Harry Ganzeboom

The project

name Social science data
personal website http://home.fsw.vu.nl/HBG.Ganzeboom/
contact

Harry Ganzeboom, Department of Social Research Methodology

website institute

http://www.fsw.vu.nl/Organisatie/index.cfm/home_subsection.cfm/subsectionid/CC38F83E-3949-431C-9A37C8AACF1788A6

The interview

Dirk Roorda & Marjan Balkestein on 2007-04-04 at the Vrije Universiteit Amsterdam with Harry Ganzeboom.

The questions

- Tell us about your data

- - what is the purpose of the data?
- - who are the producers?
- - who are the users?
- - what is the format and overall size of your datasets?
- - what is the rate of addition/mutation/deletion of your data?

- Tell us about your organisation

- - how are you organised?
- - - is there an internal IT department in charge of hardware/software?

- - - do you have FT employees specifically in charge of maintaining an archive?
- - what are your parent-, child- and sibling institutes?
- - how are your datasets managed

- Let us talk about digital preservation!

- - what is your perception of digital preservation?
- - which measures has your organisation taken to preserve data?
- - how can DANS>MIXED help to preserve your data?
- - what peculiarities does your data exhibit, either in content, form or handling?

 


Observed problems in data

From Harry

Observed problems in data preservation: data on punch cards still worthwhile, never migrated to newer systems, now lost. Catch words: column binary, Osiris.

Principal question of Harry: what is the added value of XML?

Main wish: ability to query over multiple (spss) datasets for names of variables. More general: he wants to find datasets that he finds interesting, which can be deduced from the metadata (the model, the code book). It is cumbersome to have to open each dataset in order to look for its variables.

Suggestion: look at the harmonisation project (NWO middelgroot) Tilburg - Nijmegen - Twente (Paul de Groot, Ruud Luycks, Jack Hagenaars) where for 5 family questionnaires a harmoniized code book is being distilled. Would this work profit from the MIXED approach: migrate data into DDI?

Question: what do big repositories of social science data do? What strategy do they have to ensure digital preservation?

For Harry data is SPSS-like data.

SPSS is a defacto standard. When asked how strong the usage of SPSS is at the moment, he observed that it might not have eternal life. First of all, the focus of SPSS itself does not seem to be the social sciences anymore, but commercial enterprises. There is competition, such as SAS. There are software packages that do better, and more advanced statistical analyses. To decoupling data from SPSS is certainly a meritable enterprise.

Privacy issue: he tries to separate numbers and short strings from long strings, which betray the identity of the participants in questionnaires. Cumbersome to do, this separation. After half a century the identifying information could prove a valuable asset. Same with address details of members of panels. Yet, when data is not used, it is hard to be motivated to keep the data up to date, correct, tidy.

We are allowed to use Harry's data in Steinmetz for experimentation in the MIXED project.

From Dirk

XML is a format between SGML and HTML, ideally suited as packaging paper for data. It makes explicit the granularity of the data, and the relationships between chunks of data. It is not only document-oriented, but also database oriented: document type defintion versus schema.

XML is better than ASCII, also because of UNICODE.

XML is better than comma separated files, because there you always have problems with the field delimiters and the quotation characters.

By migrating files to XML, there structure becomes plain, decoupled from hideous application details. This XML structure enables better searching.

But: MIXED only lays the foundations, does the ground work. It does not work on canonical semantical interpretations of specific domains. It picks the best XML standards for the domains in question, and uses them as archival format. So your mileage may vary dependent on the inherent quality of the XML domain standard in question.