Music has been a meaningful part of all cultures for thousands of years, but the concept of rating music is relatively new. Since the late 1950s, Billboard has been tracking and ranking the popularity of music leading to a large, quantifiable dataset. With the power of modern data science, it is easy to analyze and identify trends in big data. With some modern songs spending a surprisingly long amount of time on the charts, we began to wonder if there was a correlation between when certain top songs were released and how long they stayed popular as quantified by music charts like the Billboard Top 100. We hypothesized that as time progressed newer songs would, in general, spend more time and remain in higher positions on the charts due to factors such as increasing access to music through online distribution and the greater incorporation of popular songs in media such as in sports or in advertising. By finding trends in the Billboard Top 100 charts such as the number of weeks popular songs and artists spent at the number one position and their positions on the chart over time, we hope to be able to show that as music approached the current day, these factors would trend upwards, with newer hit songs and their artists staying at number 1 and on the chart for a longer period of time and staying in a higher position overall in comparison to their older counterparts before completely dropping out of the chart.
In terms of privacy, our data is taken from a publicly available source here and so there were not too many considerations for us to take in this aspect. One thing to consider is that our data might not be entirely objective. Billboard has their own internal methods of classifying and ranking songs on their Top 100 chart which have not been made known to the public. Due to the secrecy behind their methods it is difficult to know the validity of their rankings. However, taking into account widespread media acknowledgement and the overall reputation of Billboard, we can see that each Top 100 chart is reflective of the general music trends for each of their respective weeks.
Our data was taken from the Billboard Top 100 website using the billboard.py API which can be found here. Using the API, we were able to get every Billboard Top 100 chart from mid-1958 until February 2018. Once all of these charts were retrieved, we extracted them into .json files using the merge_json.py script which can be found in our GitHub repository here. Using this data, we began to look for statistics regarding the Billboard Top 100 such as which artists had the most top 10 appearances on the chart and the positions of popular songs over each decade from the 1950s until today over their entire Billboard lifespans.
Most of our data has been saved to be displayed statically, but if anyone is interested in playing around with our data set, feel free to check out our Interactive Graph Plotting section after running all prior cells in order.
As mentioned earlier, our data was grabbed from the Billboard Hot 100. There were two methods of doing this. One method was to download full HTML pages and use BeautifulSoup to parse and extract the Hot 100, and the other was to use the billboard.py API. Since dealing with raw HTML is quite rough, we opted to go with the API.
Though the API was a lot more convenient, it brought up the issue of not giving us a file to use. To combat this, we wrote the script save_chart_as_json.py
, which downloads all of the charts, adds them to a huge dict, then serializes that dict into a .json file.
We ran into a couple issues with this script. Firstly, for some unknown reason, the program wouldn’t download every single chart and would cut off at a specific date. To combat this, the script was modified to save that first section into a .json file, then to start again from where it left off and save the rest of the data into a second .json file.
Now, we had to merge and verify the .json files. For this, two scripts were written: merge_json.py
and print_json.py
. The former took the two .json files and merged them into a file called full_data.json. The latter could be used to print out the .json files in a more legible format than one would get from simply using a command like cat
, more
, or less
.
Finally, we could start to clean our data. Luckily, since Billboard keeps their records essentially immaculate, there wasn’t too much to do here. Firstly, we read in the data into a DataFrame, but found that the formatting that was easiest for serializing was the worst for usage. Each song entry was a list with 5 entries, and a full chart took up a row, giving us 3,108 rows and 100 columns. A more optimal DataFrame for usage would have each song take up a row and each attribute be a column (giving 7 columns). Before doing that, we had to deal with NaN values; only 126 individual entries out of 310,800 songs were NaN, so for those we did simple replacement to fill them in with -1 (since -1 is a value that won’t naturally occur in the charts).
Next up was extraction. To accomplish this, a code block populates lists with one containing numbers 1-100 (for re-numbering), 1-3,108 for keeping the current valid chart week, then lists with all of the Song Names, peak positions, last positions, artist, and number of weeks on the chart. This reshaped our DataFrame from being 3,108 rows by 100 columns to being 310,800 rows by 7 columns, and with this DataFrame we were able to do all of our analysis.
# Imports for the project
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sys import stdout as std
First up: Reading in the data and examining the first sections of it to make sure all is well.
# Read in the data into DataFrames to start our analysis
df2 = pd.read_json('./full_data.json')
# Now, let's sort this data to get it formatted where each row is a full chart and each entry is a
# list of attributes relating to the song at that position
sort_mod_df = df2.reindex_axis(sorted(df2.columns), axis=1)
# Let's examine the dataframe
sort_mod_df.head()
We have our song data! Now to extract and clean it up a bit. Though the above format makes it pretty easy to track a particular song over the chart, we also want to separate each attribute so that we can grab particular portions of our data more easily.
To do this, we'll want to change the columns to represent a specific attribute of the chart. Currently, each chart takes up a full row where each columns is the particular rank of a song in that chart, but we're going to change the shape so that there will be 7 columns instead of 100, with 310,800 rows instead of 3,108.
# Save the current chart week numbers in a list for restoration later
SongWeeks = []
curr_pos = []
for num in range(len(sort_mod_df)):
SongWeeks.append(num)
for num in range(1, 101):
curr_pos.append(num)
# Let's now see how many entries are blank in our data
n_nan = sort_mod_df.isnull().sum()
n_nan = sum(n_nan)
print(n_nan)
# Replae NaN entries with a default value
sort_mod_df.fillna(-1, inplace=True)
# Verify that blank entries are gone
n_nan = sort_mod_df.isnull().sum()
n_nan = sum(n_nan)
print(n_nan)
The next section separates out each entry of the DataFrame, organizing it into lists. Then, those lists are turned into Pandas Series objects and combined into a cleaned out DataFrame with the desired shape.
# Next, put all of the song attributes into separate lists to use as columns
# Create the lists
SongName = []
Artist = []
PeakPos = []
LastPos = []
NumWeeks = []
# Iterate through the entire dataframe
for index, series in sort_mod_df.iterrows():
# Grab particular entries from the current row
for entry in series:
# Handle case of missing data
if entry == -1:
nm = entry
NumWeeks.append(nm)
LastPos.append(nm)
PeakPos.append(nm)
Artist.append(nm)
SongName.append(nm)
else:
for i in range(0, 5):
nm = entry.pop()
if i == 0:
NumWeeks.append(nm)
elif i == 1:
LastPos.append(nm)
elif i == 2:
PeakPos.append(nm)
elif i == 3:
Artist.append(nm)
elif i == 4:
SongName.append(nm)
# Delete the copy dataframe. It's all empty now, anyway
try:
del sort_mod_df
except(NameError):
pass
# Now, let's build a nicely cleaned dataframe
# Use a list comprehension to extend out our SongWeeks list
new_SongWeeks = [item for item in SongWeeks for i in range(100)]
# Use a list comprehension to extend the 1-100 list
new_currPos = [num for i in range(len(SongWeeks)) for num in curr_pos]
# Make our empty DataFrame
sep_df = pd.DataFrame()
# Make each list into a series
col0 = pd.Series(data=new_SongWeeks)
col1 = pd.Series(data=SongName)
col2 = pd.Series(data=Artist)
col3 = pd.Series(data=PeakPos)
col4 = pd.Series(data=LastPos)
col5 = pd.Series(data=NumWeeks)
col6 = pd.Series(data=new_currPos)
# Add each series to the DataFrame
sep_df['Week of Chart'] = col0.values
sep_df['Current Position'] = col6.values
sep_df['Song Name'] = col1.values
sep_df['Artist'] = col2.values
sep_df['Peak Position'] = col3.values
sep_df['Last Position'] = col4.values
sep_df['Number of Weeks'] = col5.values
# Examine our new DataFrame
print(sep_df.shape)
sep_df.head(10)
In this new DataFrame, the "Week of Chart" column refers to how many weeks prior to 24 February 2018 that particular chart is, with 24 February 2018 being the date of the first chart we downloaded. Therefore, the older the chart, the larger the number.
Now that we have our nicely sorted and cleaned data, let's start analyzing it to see what trends and patterns we can find.
First, let's check out staying power. Often, people use the amount of time on the charts as a measurement of how good a song is. The better the song, the more weeks it is able to hold a chart spot. However, this is only part of the story; a song that holds spot No. 76 is unlikely to garner as much attention as the top 10 or top 5 songs. So, let's compare all of the songs that have reached the top 5 and see who has been there the longest
# Separate out all the songs that reached the top 5 (All songs with 5 or less in their peak position column)
top_df = sep_df.loc[(sep_df['Peak Position'] <= 5) & (sep_df['Peak Position'] != -1)]
top_df.head(10)
So, ignoring the very first chart (3107) and the most recent one we downloaded (0), which song has the most amount of time spent on the hot 100?
top_df.loc[top_df['Number of Weeks'].idxmax()]
# First, let's grab the relevant data
rdact = sep_df.loc[(sep_df['Song Name'] == "Radioactive") & (sep_df['Artist'] == 'Imagine Dragons')]
# Now, graph the data with the Current Position on the Y axis and the Week of Chart on the X axis
x = rdact['Week of Chart']
y = rdact['Current Position']
plt.plot(x, y, '-o')
# Label the axes
plt.xlabel('Chart Week')
plt.ylabel('Current Position')
# Add a title
plt.title("Lifespan of 'Radioactive' by Imagine Dragons")
# We're inverting the axes for a more intuitive feel. This way, the passage of time
# goes left to right, and the more popular the song, the higher it is on the graph
# Each dot represents a different week's chart
plt.gca().invert_xaxis()
plt.gca().invert_yaxis()
plt.show()
Looks like our winner is "Radioactive" by Imagine Dragons, with a whopping 87 straight weeks spent on the Billboard Hot 100. This record was reached 199 weeks ago from 24 February 2018, or approximately the date 9 May 2014. The song initially debuted on 18 August 2012 and was on the chart until that date. As shown in the graph, “Radioactive” sees a steady incline from about position 90 to position 3. From there, the song’s ranking declines at a generally lower rate than the incline, finally tapering off at position 48 (but not before seeing a resurgence into the top 10). Overall, “Radioactive” appears to persist within the top 20 for about a third of its time spent in the top 100.
Individual time spent on the charts for a single song is one way to measure the popularity of a song, but for an artist, in general, it's often better to look at the amount of times that they have a song that appears on the charts. Let's see the top 10 artists with the most Billboard chart appearances.
%%time
# Create a dict of values where key = artist name, value = number of appearances
most = {}
# Iterate through our dataframe
for index, series in sep_df.iterrows():
# Grab name of artist for the current row
curr = series['Artist']
# If the artist isn't already in the dict, add them
if curr not in most.keys():
most[curr] = 1
# They are there, so increment their number
else:
num = most.get(curr)
num += 1
most[curr] = num
Now, let's display the results in a bar graph.
# Create lists to store relevant top 10 data
top10_songs = []
top10_artists = []
# Grab the top 10, in order from 1-10
for num in range(10):
# Grab the key with the highest corresponding max value entry
max_value = max(most.keys(), key=lambda k: most[k])
# Add the key to the artists list
top10_artists.append(max_value)
# Add the number of appearances to the other list
top10_songs.append(most.get(max_value))
# Delete the value to find the next highest
del most[max_value]
# Plot the top 10
plt.bar(top10_artists, top10_songs, width=0.6)
# Add x axis label
plt.xlabel('Artist')
# Rotate the names to be more readable
plt.xticks(rotation=-45)
# Add y axis label
plt.ylabel('Number of Appearances')
# Add graph title
plt.title("Top 10 Most Appearances")
plt.show()
This graph displays the ten artists, with the most top 10 appearances on the Billboard 100 Chart sorted in alphabetical order. Note that the number of appearances ranges from around 600 to around 950. Within this range, the artist with the lowest number of appearances is Chicago, a rock band formed in 1967 (I hope this is the right Chicago). The artist with the highest number is Elton John, who has been an active musician since around 1963. Interestingly, the artist with the third highest number of appearances (at roughly around 800) is Taylor Swift, the only artist in this particular list who began her active career in the 21st century. She is also one of only two women in this list, apart from Madonna (who actually holds the second highest number of appearances).
To get a better intuition as to how time has changed the life spans of songs on the Billboard charts, we can try to graph multiple songs. Rather than hard code which values appear, we've made an interactive section which allows you, the user, to find songs and display a life cycle graph for the song chosen. This next block defines the functions that get called by the interactive script section.
"""
Takes in the user inputted artist name and returns a DataFrame of all of the
number 1 hits for that artist
Params: artist: The artist name to search for
option: boolean value for determining if searching for No. 1's exclusively
Return: A DataFrame of number 1 hits for an artist, cleaned up. If artist
has no number 1 hits/no songs on any charts, an empty DataFrame is returned
"""
def find_no_1s(artist, option):
# Determine if user wants to grab from just the number 1 songs or all of the songs
if option == True:
number1s = sep_df.loc[(sep_df['Peak Position'] == 1) & (sep_df['Artist'] == artist)]
else:
number1s = sep_df.loc[sep_df['Artist'] == artist]
number1s.drop_duplicates(['Song Name'], keep='last', inplace=True)
number1s.reset_index(inplace=True)
return number1s
"""
Takes in a name of a song and an artist name, creates a DataFrame for that artist/song pair,
builds a graph of that relationship, then returns that graph for plotting.
Params: songname: The song title to search for
artist: The artist name to search for
Return: A graph of the lifespan of the given song
"""
def display_results(songname, artist):
# First, let's grab the relevant data
rdact = sep_df.loc[(sep_df['Song Name'] == songname) & (sep_df['Artist'] == artist)]
# Now, graph the data with the Current Position on the Y axis and the Week of Chart on the X axis
x = rdact['Week of Chart']
y = rdact['Current Position']
plt.plot(x, y, '-o')
# Add labels and a title
plt.xlabel('Chart Week')
title_string = "Lifespan of '" + songname + "' by " + artist
plt.title(title_string)
plt.ylabel('Current Position')
# Invert axes for same reason as abovee
plt.gca().invert_xaxis()
plt.gca().invert_yaxis()
return plt
Now, let's set up the environment for the user to interact with the above functions.
# Create an interactive graph plotting section where users can see the lifespan of different songs
# using the above functions.
try:
while True:
option = False
# Let user decide if they want to search just number 1 options, or all options
usr_input = input("Select if you want to search only number 1 hits (T), or all hits (F): ")
# Parse entered option
if usr_input.upper() == 'T':
option = True
elif usr_input.upper() == 'F':
option = False
# Invalid option; treat as a quit signal
else:
raise KeyboardInterrupt
# Output the prompt and grab the user's input
std.write("Please enter an artist to search for (including 'Featuring' if you want features). ")
std.write('Type \'Q\' to quit: ')
to_search = input()
print()
# Verify that they don't want to end the program
if to_search.upper() == 'Q':
raise KeyboardInterrupt
# Call function to get the dataframe
result = find_no_1s(to_search, option)
# Make sure the dataframe has entries
if(len(result) == 0):
print("ERROR: artist entered with no hits")
continue
# Report matches to the user
if option == True:
print("Your artist's No. 1 hits are: \n%s" % repr(result['Song Name']))
else:
print("Your artist's chart hits are: \n%s" % repr(result['Song Name']))
# Grab the song name
item = input("Please type the name of a song from the list below to search for: ")
# Initialize the boolean value
exists = False
# Iterate through the returned dataframe
for ind, ser in result.iterrows():
if item == ser['Song Name']:
exists = True
# Make sure the user's entered song actually is a valid entry
if exists == False:
print("Song name '%s' not valid, try again" % item)
continue
# Call function to get the graph
graph = display_results(item, to_search)
# Display the graph to the user
graph.show()
# Handle case of user typing 'q' to quit
except(KeyboardInterrupt):
pass
These graphs depend on interactive input, so results will vary by artist and song. Users can toggle between the graphs of any song within the top 100 OR any song that has peaked at the number one position. All graphs track the trajectory of any song within the timespan of its top 100 status. The scaling for any given graph varies depending on the amount of time that a song has spent in the top 100, meaning that a more recent song is likely to have less plot points than an older song. Here, we have printed 18 songs--3 representing each decade since the 60s--for visual purposes. With that in mind, we notice that the more recent number one hits are likelier to peak more quickly and for longer, seeing generally slower declines than the older top one hits. These two traits are especially noticeable when making drastic comparisons, such as between the oldest and newest songs. For example, take “Georgia On My Mind” (1960) by Ray Charles, and compare it with the song “Blurred Lines” (2013) by Robin Thicke ft. T.I. and Pharrell (Williams). “Georgia On My Mind” experiences a smoother curve resembling an upturned parabola with one local maximum. On the other hand, “Blurred Lines” sees a rougher, more irregular curve. It springs to the peak relatively quickly and remains there until it slowly declines, and even then brief spikes intersperse along the way. Also notice that most popular songs from earlier decades tended to have spent less time on the Billboard 100 than most recent popular songs. For example older hit songs like “Hey Jude” (1968) and “Back in My Arms Again” (1965) spent less than 20 weeks on the Billboard 100 while contemporary hit songs like “Blurred Lines” (2013) and “Despacito” (2017) spent more than 50 weeks on the chart.
We acknowledge that although these graphs reflect intuitive changes in peak growth and longevity over the decades, the relationship we found may be harder to apply to a much larger sample size. We chose to analyze a specific population of songs: those within the top 100. Although this amounts to potentially thousands of songs, this still might not be representative of every single song that has been released over the decades since 1958. As a result, if we apply our tests to this wider pool, we may find that there is no correlation between decade and peak longevity. This particularly applies to songs that do not experience as drastic a peak (if at all) as those in the top 100 charts. And though we have found some trends in terms of song rankings on the Billboard 100 over time, it is hard for us to exactly pinpoint the cause of these trends. While some could infer that factors such as the widespread access of music given by newer technology, the role of popular songs in media, and increasing cultural influence from countries outside of the western world have influenced song popularity over time it is hard to prove that these factors are the cause of our trends. Overall, however, we believe our findings to be quite interesting and have great potential in being further looked into.