Data Migration

Search this site with Google

The process of moving data from its current location, typically requiring transformation to allow it to be used in the target system. This is a very common requirement during Information Architecture projects, for example defining Classification schemes often requires that an existing list is adjusted so it can be utilised.

Frequency

The majority of data migrations are "one-off". That is the data is transformed and used. This allows the implementor to take a number of short-cuts that would not be acceptable in a more systematic migration. Even in this situation, however, it is often valuable to be able to relate the final values to the initial ones. This is because of the two different sets of corrections that may arise in future, corrections to the migrated data help to suggest ways that the source can be upgraded and subsequent corrections to the source data may help improve the migrated information. Keeping track of "audit" values so the two data sources can be related to each other will often make this process simpler.

The other extreme occurs when two sets of data are to be synchronised. For example if a PDA and laptop both contain lists of contacts, and both potentially can read and correct the entries. Implementing such systems can be complex and time consuming.

There are many situations which fall between these two extremes. The crucial element is to be clear about which repositories can be updated and by whom. Once this has been clearly defined the data flows necessary to enable the system to work can be quickly identified.

Transformations

Six layers of data format

All data is held in its own data form. Any data migration must transform the data at six different levels:

  • Data Convention: The conventions applied to data within a given location
  • Reference Integrity: Ensuring that certain attribute of the data can only have a restricted set of values (for example storing the identifier of another item)
  • Data Structure: A data model formally defines the data elements and the relationships between them for a domain of interest A data model describes how data is represented and accessed.
  • Format: A format defines how entities and attributes are to be described.
  • Encoding: A code that pairs a sequence of characters from a given character set with something else, such as a sequence of numbers
  • Transport: A description of how data is to be moved or held. The lowest level of a data form. Since standard approaches to file structure and communication protocols were adopted in the 1980s and 1990s there has been little need to discuss transport except in relation to physical media

To pick an example let us suppose we have a set of weekly music charts on a web site and we need a list of the artists mentioned and a count of the number of weeks that each had in the charts.

The source data is accessed via HTTP as a set of pages one for each week. We need to convert it so that it can be inserted into a relational data model. So we have to convert at each of the six levels, it is usual to start at the lowest level one:

  • Transport: The source is available from HTTP requests, so we have to find some way to spider the data
  • Encoding: Most web pages use an English based encoding, however we may need to map characters in some artist's names such as Björk
  • Format: The source data is formated as HTML, which is not an easy format to parse. A mechanism must be devised to identify where on the page the required data is to be found and extract it
  • Data Structure: Once the data is extracted the structure must be changed, entries for the same artist must be consolidated to define the number of weeks that each song is present
  • Reference Integrity: It is usually a good practice to check that the artists identified match a defined list, this not only identifies spelling errors (such as "The Beetles") but also ensures consistency in naming (for example how is the artist "Prince" best dealt with)
  • Data Convention: Even once all the above issues have been dealt with there are conventions that need to be addressed. For example how are collaberations to be counted, what about songs that have live versions which were also hits and so on

These transformations are complicated enough even if we are dealing with a "one-off" migration. If there are elements of synchronisation they have to be performed in a systematic way.

Tools

There are a wide variety of tools that can help with data migrations, for example:

  • Excel: Many simple data transformations can be performed by a good spreadsheet. Unfortunately they tend to modify values, for example "1-2-3" may be changed to "01/02/2003"
  • Perl: A knowledge of a good scripting language such as Perl, Python or Ruby is a major help
  • PL-SQL: Many databases have associated languages that can perform transformations on data
  • EMACS: A good program editor such as Emacs or Vi can perform a variety of transformation tasks

Links to this page

There are no links to this page


Comment on the contents of the 'Data Migration' page
Subject: Email to Reply To (optional):