Step-by-step: Scraping Epsiode IMDb Ratings

May 9, 2020 8 min read 6 Comments python, tutorial, web scraping

repository dataset

There are tons of tutorials out there that teach you how to scrape movie ratings from IMDb, but I haven’t seen any about scraping TV series episode ratings. So for anyone wanting to do that, I’ve created this tutorial specifically for it. It’s catered mostly to beginners to web scraping since the steps are broken down. If you want the code without the breakdown you can find it here.

There is a wonderful DataQuest tutorial by Alex Olteanu that explains in-depth how to scrape over 2000 movies from IMDb, and it was my reference as I learned how to scrape these episodes.

Since their tutorial already does a great job at explaining the basics of identifying the URL structure and understanding the HTML structure of a single page, I’ve linked those parts and recommend you read them if you aren’t already familiar because I won’t be explaining them here.

In this tutorial I will not be redundant in explaining what they already did¹; instead, I’ll be doing many similar steps, but they will be specifically for taking episode ratings (same for any TV series) instead of movie ratings.

Breaking it down

First, you’ll need to navigate to the series of your choice’s season 1 page that lists all of that season’s episodes. The series I will be using is Community. It should look like this:

Get the url of that page. It should be structured like this:

http://www.imdb.com/title/tt1439629/episodes?season=1

Highlighted is the part that is the show’s ID and will be different for you if you’re not using Community.

First, we will request from the server the content of the web page by using get(), and store the server’s response in the variable response and look at the first few lines. We can see that inside response is the html code of the webpage.

from requests import get
url = 'https://www.imdb.com/title/tt1439629/episodes?season=1'
response = get(url)
print(response.text[:250])

<!DOCTYPE html>
<html
    xmlns:og="http://ogp.me/ns#"
    xmlns:fb="http://www.facebook.com/2008/fbml">
    <head>
         
        <meta charset="utf-8">
        <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <meta name="apple-itunes-app" content="app-id=342792525, app-argument=imdb:///title/tt1439629?src=mdot">

Use BeautifulSoup to parse the HTML content

Next, we’ll parse response.text by creating a BeautifulSoup object, and assign this object to html_soup. The html.parser argument indicates that we want to do the parsing using Python’s built-in HTML parser.

from bs4 import BeautifulSoup

html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)

bs4.BeautifulSoup

This part onwards is where the code will differ from the movie example.

The variables we will be getting are:

Episode title
Episode number
Airdate
IMDb rating
Total votes
Episode description

Let’s look at the container we’re interested in. As you can see below, all of the info we need is in <div class="info" ...> </div>:

In yellow are the tags/parts of the code that we will be calling to get to the data we are trying to extract, which are in green.

We will grab all of the instances of <div class="info" ...> </div> from the page; there is one for each episode.

episode_containers = html_soup.find_all('div', class_='info')

find_all() returned a ResultSet object –episode_containers– which is a list containing all of the 25 <div class="info" ...> </div>s.

Extracting each variable that we need

Read this part of the DataQuest article to understand how calling the tags works.

Here we’ll see how we can extract the data from the episode_containters for each episode.

episode_containters[0] calls the first instance of <div class="info" ...> </div>, i.e. the first episode. After the first couple of variables, you will understand the structure of calling the contents of the html containers.

Episode title

For the title we will need to call title attribute from the <a> tag.

episode_containers[0].a['title']

'Pilot'

Episode number

The episode number in the <meta> tag, under the content attribute.

episode_containers[0].meta['content']

'1'

Airdate

Airdate is in the <div> tag with the class airdate, and we can get its contents the text attribute, afterwhich we strip() it to remove whitespace.

episode_containers[0].find('div', class_='airdate').text.strip()

'17 Sep. 2009'

IMDb rating

The rating is is in the <div> tag with the class ipl-rating-star__rating, which also use the text attribute to get the contents of.

episode_containers[0].find('div', class_='ipl-rating-star__rating').text

'7.8'

Total votes

It is the same thing for the total votes, except it’s under a different class.

episode_containers[0].find('span', class_='ipl-rating-star__total-votes').text

'(3,187)'

Episode description

For the description, we do the same thing we did for the airdate and just change the class.

episode_containers[0].find('div', class_='item_description').text.strip()

'An ex-lawyer is forced to return to community college to get a degree. However, he tries to use the skills he learned as a lawyer to get the answers to all his tests and pick up on a sexy woman in his Spanish class.'

Final code– Putting it all together

Now that we know how to get each variable, we need to iterate for each episode and each season. This will require two for loops. For the per season loop, you’ll have to adjust the range() depending on how many seasons are in the show you’re scraping.

The output will be a list that we will make into a pandas DataFrame. The comments in the code explain each step.

# Initializing the series that the loop will populate
community_episodes = []

# For every season in the series-- range depends on the show
for sn in range(1,7):
    # Request from the server the content of the web page by using get(), and store the server’s response in the variable response
    response = get('https://www.imdb.com/title/tt1439629/episodes?season=' + str(sn))

    # Parse the content of the request with BeautifulSoup
    page_html = BeautifulSoup(response.text, 'html.parser')

    # Select all the episode containers from the season's page
    episode_containers = page_html.find_all('div', class_ = 'info')

    # For each episode in each season
    for episodes in episode_containers:
            # Get the info of each episode on the page
            season = sn
            episode_number = episodes.meta['content']
            title = episodes.a['title']
            airdate = episodes.find('div', class_='airdate').text.strip()
            rating = episodes.find('span', class_='ipl-rating-star__rating').text
            total_votes = episodes.find('span', class_='ipl-rating-star__total-votes').text
            desc = episodes.find('div', class_='item_description').text.strip()
            # Compiling the episode info
            episode_data = [season, episode_number, title, airdate, rating, total_votes, desc]

            # Append the episode info to the complete dataset
            community_episodes.append(episode_data)

Making the dataframe

import pandas as pd 
community_episodes = pd.DataFrame(community_episodes, columns = ['season', 'episode_number', 'title', 'airdate', 'rating', 'total_votes', 'desc'])

community_episodes.head()

	season	episode_number	title	airdate	rating	total_votes	desc
0	1	1	Pilot	17 Sep. 2009	7.8	(3,187)	An ex-lawyer is forced to return to community ...
1	1	2	Spanish 101	24 Sep. 2009	7.9	(2,760)	Jeff takes steps to ensure that Brita will be ...
2	1	3	Introduction to Film	1 Oct. 2009	8.3	(2,696)	Brita comes between Abed and his father when s...
3	1	4	Social Psychology	8 Oct. 2009	8.2	(2,473)	Jeff and Shirley bond by making fun of Britta'...
4	1	5	Advanced Criminal Law	15 Oct. 2009	7.9	(2,375)	Señor Chang is on the hunt for a cheater and t...

Looks good! Now we just need to clean up the data a bit.

Data Cleaning

Converting the total votes count to numeric

First, we create a function that uses replace() to remove the ‘,’ , ‘(', and ‘)’ strings from total_votes so that we can make it numeric.

def remove_str(votes):
    for r in ((',',''), ('(',''),(')','')):
        votes = votes.replace(*r)
        
    return votes

Now we apply the function, taking out the strings, then change the type to int using astype()

community_episodes['total_votes'] = community_episodes.total_votes.apply(remove_str).astype(int)

community_episodes.head()

	season	episode_number	title	airdate	rating	total_votes	desc
0	1	1	Pilot	17 Sep. 2009	7.8	3187	An ex-lawyer is forced to return to community ...
1	1	2	Spanish 101	24 Sep. 2009	7.9	2760	Jeff takes steps to ensure that Brita will be ...
2	1	3	Introduction to Film	1 Oct. 2009	8.3	2696	Brita comes between Abed and his father when s...
3	1	4	Social Psychology	8 Oct. 2009	8.2	2473	Jeff and Shirley bond by making fun of Britta'...
4	1	5	Advanced Criminal Law	15 Oct. 2009	7.9	2375	Señor Chang is on the hunt for a cheater and t...

Making rating numeric instead of a string

community_episodes['rating'] = community_episodes.rating.astype(float)

Converting the airdate from string to datetime

community_episodes['airdate'] = pd.to_datetime(community_episodes.airdate)

community_episodes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   season          110 non-null    int64         
 1   episode_number  110 non-null    object        
 2   title           110 non-null    object        
 3   airdate         110 non-null    datetime64[ns]
 4   rating          110 non-null    float64       
 5   total_votes     110 non-null    int32         
 6   desc            110 non-null    object        
dtypes: datetime64[ns](1), float64(1), int32(1), int64(1), object(3)
memory usage: 5.7+ KB

community_episodes.head()

	season	episode_number	title	airdate	rating	total_votes	desc
0	1	1	Pilot	2009-09-17	7.8	3187	An ex-lawyer is forced to return to community ...
1	1	2	Spanish 101	2009-09-24	7.9	2760	Jeff takes steps to ensure that Brita will be ...
2	1	3	Introduction to Film	2009-10-01	8.3	2696	Brita comes between Abed and his father when s...
3	1	4	Social Psychology	2009-10-08	8.2	2473	Jeff and Shirley bond by making fun of Britta'...
4	1	5	Advanced Criminal Law	2009-10-15	7.9	2375	Señor Chang is on the hunt for a cheater and t...

Great! Now the data is tidy and ready for analysis/visualization.

Let’s make sure we save it:

community_episodes.to_csv('Community_Episodes_IMDb_Ratings.csv',index=False)

And that’s it, I hope this was helpful! Feel free to comment or DM me for any edits or questions.

The links for my github repository for this project and the final Community ratings dataset can be found in the links at the top of this page.

In the tutorial, they have extra sections where they control the crawl-rate & monitor the loop as it’s still going, but we won’t do that here because series’ have much less episodes (they scraped 2000+ movies) so that’s not necessary. ↩︎

python beautifulsoup community jupyter notebook tutorial web scraping