| The process of moving data from its current location, 
typically requiring transformation to allow it to be used in 
the target system.  This is a very common requirement during 
Information Architecture projects, for example defining 
Classification schemes often requires that an existing list 
is adjusted so it can be utilised. FrequencyThe majority of data migrations are "one-off".  That is the data is 
transformed and used.  This allows the implementor to take a 
number of short-cuts that would not be acceptable in a more systematic 
migration.  Even in this situation, however, it is often valuable to 
be able to relate the final values to the initial ones.  This is 
because of the two different sets of corrections that may arise in 
future, corrections to the migrated data help to suggest ways that 
the source can be upgraded and subsequent corrections to the source 
data may help improve the migrated information.  Keeping track of 
"audit" values so the two data sources can be related to each other 
will often make this process simpler. The other extreme occurs when two sets of data are to be synchronised.
For example if a PDA and laptop both contain lists of contacts, 
and both potentially can read and correct the entries.  Implementing 
such systems can be complex and time consuming. There are many situations which fall between these two extremes.  The 
crucial element is to be clear about which repositories
can be updated and by whom.  Once this has been clearly defined the 
data flows necessary to enable the system to work can be quickly identified. Transformations
 All data is held in its own data form.  Any data migration must 
transform the data at six different levels: 
    Data Convention: The conventions applied to data within a given locationReference Integrity: Ensuring that certain attribute of the data can only 
have a restricted set of values (for example storing the identifier 
of another item) Data Structure: A data model formally defines the data elements and the 
relationships between them for a domain of interest
A data model describes how data is represented and accessed.  Format: A format defines how entities and 
attributes are to be described.  Encoding: A code that pairs a sequence of
characters from a given character set with something else, 
such as a sequence of numbersTransport: A description of how data is to be moved or held.  The 
lowest level of a data form. Since standard approaches to file 
structure and communication protocols were adopted in the 1980s and 
1990s there has been little need to discuss transport except in 
relation to physical media To pick an example let us suppose we have a set of weekly music 
charts on a web site and we need a list of the artists mentioned 
and a count of the number of weeks that each had in the charts. The source data is accessed via HTTP as a set of pages 
one for each week.  We need to convert it so that it can be inserted 
into a relational data model.  So we have to convert at each of the 
six levels, it is usual to start at the lowest level one: 
    Transport: The source is available from HTTP requests, so 
we have to find some way to spider the dataEncoding: Most web pages use an English based encoding, 
however we may need to map characters in some artist's names such 
as BjörkFormat: The source data is formated as HTML, which is not 
an easy format to parse.  A mechanism must be devised to identify
where on the page the required data is to be found and extract itData Structure: Once the data is extracted the 
structure must be changed, entries for the same artist must be 
consolidated to define the number of weeks that each song is 
presentReference Integrity: It is usually a good practice to 
check that the artists identified match a defined list, this 
not only identifies spelling errors (such as "The Beetles") but 
also ensures consistency in naming (for example how is the 
artist "Prince" best dealt with)Data Convention: Even once all the above issues have been 
dealt with there are conventions that need to be addressed.  For 
example how are collaberations to be counted, what about songs that 
have live versions which were also hits and so on These transformations are complicated enough even if we are dealing with 
a "one-off" migration.  If there are elements of synchronisation they 
have to be performed in a systematic way. ToolsThere are a wide variety of tools that can help with data migrations, 
for example: 
    Excel: Many simple data transformations can be performed 
by a good spreadsheet.  Unfortunately they tend to modify values, 
for example "1-2-3" may be changed to "01/02/2003"Perl: A knowledge of a good scripting language such as Perl,
Python or Ruby is a major helpPL-SQL: Many databases have associated languages that can perform 
transformations on dataEMACS: A good program editor such as Emacs or Vi can perform 
a variety of transformation tasks Links to this pageThere are no links to this page 
 |