|
How the site is generated?Each of the 87 source charts is held separately, the charts must always have entries for "artist" and "title" usually have entries for "position" and "date" and may also contain extra information such as "duration", "written by", "web page" or "film". The 270,032 entries in these charts are consolidated, to provide a complete set of attributes for each of the 143,418 items (112,554 songs and 30,864 albums). The most difficult aspect of this task is matching names, they are often misspelt in the source charts, punctuation is usually inconsistent and the list of "featured" artists is always in a different order. Programming a system to recognise that "Uncle Albert" by "Paul McCartney" and "Admiral Halsey" by "Wings" are actually the same song is not trivial. In general the approach that has been taken is to consolidate entries if they appear similar, having too many false connections is usually better than missing out on them. For this reason there are quite a few places where strict accuracy has been sacrificed to bring things together, for example all of Prince's songs are listed under the name Prince. The next step is to generate a consolidated score from the chart entry information. There are a variety of possible ways this could have been done, in this case it was decided that the simplest approach was to generate scores from each entry and sum them. The individual entry scores clearly should depend on the position within the source chart, a number one song getting more "points" than the second placed one ad so on. One reasonable way to generate a score is to use a power law. The score is set to XXX+YYYposition and then each chart is weighed by a factor that takes into account which chart was the source. The weighting can emphasise charts that are solidly based on sales, attempt to match music revenue in each country or follow any number of other reasonable strategies. Using a set of apparently reasonable variations of the weighting, the XXX and YYY values generates the following top 10 lists. As the wide range of different results shows the parameter values have a big influence on the resulting chart. Using this approach suitable parameters have to be picked to generates a "medium" chart.
Some reviewers of music charts have claimed that Zipf's distribution is a better fit to music charts than a power law, that is the second placed song has half the sales, the third has a third etc. This suggests a different, simpler scoring algorithm, if each entry's score is 1+1/position then each song gets credit for having an entry and the number one song gets most points. The simplest chart weighting is to give equal values to every chart. When these simple parameters are used the result is towards the middle of the range demonstrated above. This is the scoring algorithm that has been used here. Song YearsWorking out which year to assign an entry to is also surprisingly hard. The year of each song is deduced directly from the chart entries, rather than relying on some kind of unreliable external source. The year is extracted from the date in all the song's chart entries and the song's year is set to the median of these values. This usually generates a reasonable estimate of the year. Once the individual song scores have been calculated they are processed to generate the various web pages and the links between them. These are all static pages to reduce both the load on the underpowered web server and the security risks. |