Using Web Scraping to Create Movie Recommendations

Author

Robert Tran

Published

February 8, 2023

Web scraping is a useful tool to gather data that is publicly available on the internet.

In this post, we will be creating a movie recommendation system using the TMDB website. The recommendation system will work by going through all the actors of your favorite show or movie, and displaying the other shows/movies that they have acted in.

Here’s a link to my project repository for the full code: https://github.com/roberttran1/TMDB_scraper.git

The first step to create the scraper is to create the directory using the command scrapy startproject [NAME OF PROJECT]. This will create a directory which will contain a Python file with the name of the project. Open this python file. This will be the file that we add our functions to. Let’s take a look at the functions that we will use.

Our scraper will apply three different functions to find the movies to recommend. The first one is parse():

def parse(self, response):
        """
        parses web page containing overview of TV Show/Movie,
        finds cast and crew link,
        calls parse_full_credits() for that link
        """
        next_page = response.css("p.new_button a::attr(href)").get() # get link of next page

        if next_page:
            yield response.follow(next_page, callback=self.parse_full_credits) # call next function

This method works by identifying the link that is connected with the “Full Cast & Crew” text and will use this link in the parse_full_credits() function next.

def parse_full_credits(self, response):
        """
        parses Full Cast & Crew page,
        finds links to cast members' pages,
        directs to parse_actor_page() for that link
        """
        # get links to actor pages
        for cast in response.css("ol.people.credits a::attr(href)").getall(): 
            if "/person/" in cast:
                # call next function
                yield response.follow(cast, callback=self.parse_actor_page) 

The parse_full_credits() function works to scrape all of the actor pages for the TV show or movie that was previously selected. These actor pages will then be used in the parse_actor_page() function to find the movies and TV shows that they have been in.

def parse_actor_page(self, response):
        """
        parses actor overview page,
        finds movies that the actor has played in,
        yields actor name and movie/TV show name in a dictionary
        """
        # find actor names
        actor_name = response.css("h2.title a::text").get() 
        
        # get movies actors have played in
        for movie_name in response.css("a.tooltip bdi::text").getall(): 
            yield {"actor" : actor_name, "movie_or_TV_name" :movie_name}

This is the final function, which will yield a dictionary containing the actor’s name and the movie they acted in.

In order to run the shell, we will use the command scrapy crawl tmdb_spider -o results.csv. This command will save the dictionaries into a file called results.csv. With this file, we can access our recommendations and see what movies are the most recommended.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv("results.csv")

# sort by shared actors
grouped_df = df.groupby(["movie_or_TV_name"]).count().\
sort_values(by=['actor'], ascending=False)

grouped_df
actor
movie_or_TV_name
Modern Family 1009
Bones 114
Frasier 89
CSI: Crime Scene Investigation 85
NCIS 83
... ...
Imagination Movers 1
Imagine That 1
Immediately Afterlife 1
Immortal 1
Ӕon Flux 1

13562 rows × 1 columns

Looks like for someone that likes Modern Family, the top 3 shows that this system would recommend are Bones, Frasier, and CSI. Let’s look at how some of the top shows compare with their number of shared actors.

fig, ax = plt.subplots(1, 1, figsize=(20, 10))

# create figure
g = sns.barplot(
    data=grouped_df[1:11],
    x=grouped_df[1:11].index, y="actor")

# set axis labels and title
g.set_title("Top 10 Movie/TV Show Recommendations")
g.set_xlabel("Movie or TV Show name")
g.set_ylabel("Number of shared actors")
Text(0, 0.5, 'Number of shared actors')