Web Crawler

It is often usefull to process a number of web pages, for example if a site has pages that list the information you need for a classification scheme. Access to this data is usually automated, creating a "web crawler" (or "spider") that reads the pages for you.

The first thing to check is that the site does not already have the list you need in a more convenient form. For example the music data at this site can be obtained by reading every page and extracting the entries, however it is far easier to just download the CSV that is available.

If there is no nicely formated source then the next thing to check is that the site does not object to your using its data. This is usually signalled in the "robots.txt" file to be found at the site's top level.

Assuming there is no easier source and the site does not object to crawlers there are three distinct processes that need to be implemented:

Obtain the pages containing the data
Extract the key information from the pages
Transform the data into the shape you require

It is usually best to perform these three as distinct steps, rather than attempting to build them into a single process. For example if you first download the HTML pages and save them into a local directory, then have a second step that processes each page in turn extracting the required information into a local CSV file

Tools

The book "Spidering Hacks" by Kevin Hemenway et al provides a description of the main tricks and techniques for extracting information automatically from web pages.

Personally I use the Perl language to perform all three of the tasks required.

Links to this page

The following pages link to here: Data Migration