|
How the site is generated?Each of the 141 source charts is held separately, every chart always has entries for "artist" and "title" usually they also have entries for "position" and "date" and may also contain all sorts of extra information such as "duration", "written by", "web page" or "film". The 359,583 entries in these charts are consolidated, to provide a complete set of attributes for each of the 180,199 items (130,506 songs and 49,693 albums). The most difficult aspect of this task is matching names, they are often misspelt in the source charts, punctuation is usually inconsistent and the list of "featured" artists is always in a different order. Programming a system to recognise that "Uncle Albert" by "Paul McCartney" and "Admiral Halsey" by "Wings" are actually the same song is not trivial. In general the approach that has been taken is to consolidate entries if they appear similar, having too many false connections is usually better than missing out on them. For this reason there are quite a few places where entries have been changed to bring things together, for example all of Prince's songs are listed under the name Prince rather than splitting them into Prince & the New Power Generation, Prince (symbol), Love Symbol, The Artist Formerly Known As... and so on. Assigning a ScoreWhen we originally started gathering chart information we didn't want to allocate an arbitrary score to each song, our goal was to provide a list of the achievements of each record without imposing any kind of artificial order. We quickly found, however, that a rough indication of the importance of each item was a really essential guide to presentation. So we discussed what approach we should use to assign a "score" for each item. We considered this question in three parts:
Score for each appearanceOne obvious way to allocate a score would be to just count the number of chart appearances, but we felt that an entry at the number one slot is more notable that one lower down the chart. An alternative that we have seen used is to give 1 point for a 99th position, 2 for 98th and so on up to 98 points for a 2nd and 99 for a number one. Again that doesn't feel right, surely a number one record is significantly more notable that a number 2, and what about charts with 200 entries? So an interesting question is what is the basis of the score, what are we trying to estimate? Well, if the score reflects a rough estimate of the "notability" of the song, a combination of sales, airplay and mindspace then we should adopt an approach that models, for example, sales figures.
There are two different curves that have been suggested as being good descriptions of sales, one is an "exponential decay" that suggests the Nth best selling item sells Y**N times as many as the top one, where Y is a number between 0 and 1 and ** is the "power of" operator. The other estimate is Zipf's law, which says that the 2nd best seller sells half, the 3rd one third and so on. This suggests two different ways to score a chart entry:
In fact, for reasonable selections of the parameters, both approaches deliver roughly similar results, but the parameters X and Y values do emphasise different records.
Here is a "phase diagram" showing which song comes out top as different values of X and Y are selected. The black circles indicate six combinations of parameters that could each be considered reasonable. Here is a comparison of the top 10s that result for these values. As you can see the results are roughly similar, however their differences show just how arbitrary any scoring mechanism is. So if we select the simpler algorithm, the one based on Zipf's Law, and middling parameters, we have:
This means that 3 number one hits are equivalent to 4 at number two, 5 at number five and roughly equivalent to 6 at number 100. That feels like it is fair, it is fast to calculate and gives results that are as good as any other scoring approach. Weighing each chartThe next question is how to assign a "notability" factor to each chart. For example we might decide that each chart with a small number of entries should get a proportionally higher score to reflect the overall importance of the total chart. However that approach just emphasises charts with fewer entries. Alternately we could decide to sum the total scores of all the charts from a particular country and rescale them so they reflect the global sales of music in that country. Or we could put higher weight on more "official" charts. In fact what we are trying to reflect is a "global perspective on music impact". So in many ways the very fact that a chart is accessible on the internet is an indicator of the level of interest in that region and period. The charts we have roughly reflect the market size. We need to make a continued conscious effort to incorporate as wide a range of inputs as we can, and keep checking that we don't keep expanding already over-represented markets (such as the UK). As long as we have a balanced set of charts we can exploit "The Wisdom of Crowds" and not attempt to impose any artificial weighting. We can use the simplest possible chart weighting factor, that is all chart entries are multiplied by 1. Combining ScoresThere are all sorts of ways that we could combine the scores of from different charts. The simplest approach is just to add them up. This is what we do. Special CasesThe preceding approach works well for charts that have a position attribute. But what do we do for those that don't? What about:
Artists, Years, Decades and hits in EuropeIn order to combine scores to rate, for example, artists, we take the obvious approach of summing the items they produced. This is the algorithm used for all the "normal" web pages. If we are doing a special calculation (that is for one of the FAQ pages) we often adjust the scores to take into account the large number of recent charts. So, for example, when working out the most successful song or greatest song act, we employ an adjustment to normalise the scores. This is normally calculated by averaging the score of entries in the fifth to tenth positions and using the result to rescale the scores. Some experimentation has shown that this produces reasonable results. Song YearsWorking out which year to assign an entry to is also surprisingly hard. The year of each song is deduced directly from the chart entries, rather than relying on some kind of unreliable external source. The year is extracted from the date in all the song's chart entries and the song's year is set to the median of these values. This usually generates a reasonable estimate of the year. Putting it all together
Once the individual song scores have been calculated they are processed to generate the various web pages and the links between them. These are all static pages to reduce both the load on the underpowered web server and the security risks. As the diagram shows the process also generates some summary statistics and other test data. This is used both to spot when new data has introduced issues and to simplify the task of identifying entries that need to be reviewed. Different Approach?As with all the calculations described on the site you can decide to try a different approach, the available CSV File gives you the data you need. If your algorithm illustrates something interesting we would like to hear about it. Previous Comments 2 Dec 2009 Genesis biggest seller Hi I read with a friend of mine on a website that Genesis highest selling album is not We Can't Dance it is actually Invisible Touch. It looks like to me that your numbers and maybe statistics are outdated. We Can't Dance sold 22,000,000 copies worldwide and went 4 times platinum in the U.S. and Invisible Touch went 6 times platinum in the U.S. and has sold 24,000,000 copies globally. I even read on the VH1 website that Invisible Touch is Genesis highest selling pop album to date. Thanks Aaron Wake First of all thank you for your input. Of course most of the lists on this site measure success in the charts round the world and not sales. So, for example, the 1983 album "Genesis" lists higher than "Invisible Touch" because it was a hit in more countries, but it is fairly certain that its sales were lower. Claims of worldwide sales on web sites are generally not trustworthy, for example on Wikipedia look at the number of page modifications and the length of discussions on all pages that mention sales numbers. The issue is that there are no validated worldwide sales numbers so every group of fans use their own approach to estimate numbers. We do have a page that lists our estimates of sales numbers. We use a combination of the certification levels from the US, UK, France and Germany (the four biggest markets). That page describes how we combine those numbers to estimate worldwide sales. It also explains why we don't trust even our own listing of sales numbers. The two albums you mention are listed on that page. Your numbers for the US sales are correct, however you don't list the fact that "We Can't Dance" was also 5xPlatinum in the UK and 5xPlatinum in Germany, while "Invisible Touch" was 4xPlatinum in the UK and 1xPlatinum in Germany. "Invisible Touch" was more successful in the US, however the certifications and chart positions show that "We Can't Dance" was clearly a bigger hit in Europe. Our estimate is that "We Can't Dance" sold 15-24 million worldwide and "Invisible Touch" sold 12-17 million. We don't know where VH1 got these numbers from, however we have seen other occasions when VH1 has over emphasised acts that were successful in the US and ignored acts that were hits anywhere else in the world. This is one of the reasons why we don't use any of their charts in this site. So, we believe that the balance of evidence suggests that "We Can't Dance" sold more, however the wide margin of error leaves the question open. We also have learnt that you should not trust any website that lists worldwide album sales numbers (even our own).
|