|
The process of moving data from its current location,
typically requiring transformation to allow it to be used in
the target system. This is a very common requirement during
Information Architecture projects, for example defining
Classification schemes often requires that an existing list
is adjusted so it can be utilised.
Frequency
The majority of data migrations are "one-off". That is the data is
transformed and used. This allows the implementor to take a
number of short-cuts that would not be acceptable in a more systematic
migration. Even in this situation, however, it is often valuable to
be able to relate the final values to the initial ones. This is
because of the two different sets of corrections that may arise in
future, corrections to the migrated data help to suggest ways that
the source can be upgraded and subsequent corrections to the source
data may help improve the migrated information. Keeping track of
"audit" values so the two data sources can be related to each other
will often make this process simpler.
The other extreme occurs when two sets of data are to be synchronised.
For example if a PDA and laptop both contain lists of contacts,
and both potentially can read and correct the entries. Implementing
such systems can be complex and time consuming.
There are many situations which fall between these two extremes. The
crucial element is to be clear about which repositories
can be updated and by whom. Once this has been clearly defined the
data flows necessary to enable the system to work can be quickly identified.
Transformations

All data is held in its own data form. Any data migration must
transform the data at six different levels:
- Data Convention: The conventions applied to data within a given location
- Reference Integrity: Ensuring that certain attribute of the data can only
have a restricted set of values (for example storing the identifier
of another item)
- Data Structure: A data model formally defines the data elements and the
relationships between them for a domain of interest
A data model describes how data is represented and accessed.
- Format: A format defines how entities and
attributes are to be described.
- Encoding: A code that pairs a sequence of
characters from a given character set with something else,
such as a sequence of numbers
- Transport: A description of how data is to be moved or held. The
lowest level of a data form. Since standard approaches to file
structure and communication protocols were adopted in the 1980s and
1990s there has been little need to discuss transport except in
relation to physical media
To pick an example let us suppose we have a set of weekly music
charts on a web site and we need a list of the artists mentioned
and a count of the number of weeks that each had in the charts.
The source data is accessed via HTTP as a set of pages
one for each week. We need to convert it so that it can be inserted
into a relational data model. So we have to convert at each of the
six levels, it is usual to start at the lowest level one:
- Transport: The source is available from HTTP requests, so
we have to find some way to spider the data
- Encoding: Most web pages use an English based encoding,
however we may need to map characters in some artist's names such
as Björk
- Format: The source data is formated as HTML, which is not
an easy format to parse. A mechanism must be devised to identify
where on the page the required data is to be found and extract it
- Data Structure: Once the data is extracted the
structure must be changed, entries for the same artist must be
consolidated to define the number of weeks that each song is
present
- Reference Integrity: It is usually a good practice to
check that the artists identified match a defined list, this
not only identifies spelling errors (such as "The Beetles") but
also ensures consistency in naming (for example how is the
artist "Prince" best dealt with)
- Data Convention: Even once all the above issues have been
dealt with there are conventions that need to be addressed. For
example how are collaberations to be counted, what about songs that
have live versions which were also hits and so on
These transformations are complicated enough even if we are dealing with
a "one-off" migration. If there are elements of synchronisation they
have to be performed in a systematic way.
Tools
There are a wide variety of tools that can help with data migrations,
for example:
- Excel: Many simple data transformations can be performed
by a good spreadsheet. Unfortunately they tend to modify values,
for example "1-2-3" may be changed to "01/02/2003"
- Perl: A knowledge of a good scripting language such as Perl,
Python or Ruby is a major help
- PL-SQL: Many databases have associated languages that can perform
transformations on data
- EMACS: A good program editor such as Emacs or Vi can perform
a variety of transformation tasks
Links to this page
There are no links to this page
|