def parse(self, response):
"""
parses web page containing overview of TV Show/Movie,
finds cast and crew link,
calls parse_full_credits() for that link
"""
= response.css("p.new_button a::attr(href)").get() # get link of next page
next_page
if next_page:
yield response.follow(next_page, callback=self.parse_full_credits) # call next function
Web scraping is a useful tool to gather data that is publicly available on the internet.
In this post, we will be creating a movie recommendation system using the TMDB website. The recommendation system will work by going through all the actors of your favorite show or movie, and displaying the other shows/movies that they have acted in.
Here’s a link to my project repository for the full code: https://github.com/roberttran1/TMDB_scraper.git
The first step to create the scraper is to create the directory using the command scrapy startproject [NAME OF PROJECT]
. This will create a directory which will contain a Python file with the name of the project. Open this python file. This will be the file that we add our functions to. Let’s take a look at the functions that we will use.
Our scraper will apply three different functions to find the movies to recommend. The first one is parse()
:
This method works by identifying the link that is connected with the “Full Cast & Crew” text and will use this link in the parse_full_credits()
function next.
def parse_full_credits(self, response):
"""
parses Full Cast & Crew page,
finds links to cast members' pages,
directs to parse_actor_page() for that link
"""
# get links to actor pages
for cast in response.css("ol.people.credits a::attr(href)").getall():
if "/person/" in cast:
# call next function
yield response.follow(cast, callback=self.parse_actor_page)
The parse_full_credits()
function works to scrape all of the actor pages for the TV show or movie that was previously selected. These actor pages will then be used in the parse_actor_page()
function to find the movies and TV shows that they have been in.
def parse_actor_page(self, response):
"""
parses actor overview page,
finds movies that the actor has played in,
yields actor name and movie/TV show name in a dictionary
"""
# find actor names
= response.css("h2.title a::text").get()
actor_name
# get movies actors have played in
for movie_name in response.css("a.tooltip bdi::text").getall():
yield {"actor" : actor_name, "movie_or_TV_name" :movie_name}
This is the final function, which will yield a dictionary containing the actor’s name and the movie they acted in.
In order to run the shell, we will use the command scrapy crawl tmdb_spider -o results.csv
. This command will save the dictionaries into a file called results.csv
. With this file, we can access our recommendations and see what movies are the most recommended.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
= pd.read_csv("results.csv")
df
# sort by shared actors
= df.groupby(["movie_or_TV_name"]).count().\
grouped_df =['actor'], ascending=False)
sort_values(by
grouped_df
actor | |
---|---|
movie_or_TV_name | |
Modern Family | 1009 |
Bones | 114 |
Frasier | 89 |
CSI: Crime Scene Investigation | 85 |
NCIS | 83 |
... | ... |
Imagination Movers | 1 |
Imagine That | 1 |
Immediately Afterlife | 1 |
Immortal | 1 |
Ӕon Flux | 1 |
13562 rows × 1 columns
Looks like for someone that likes Modern Family, the top 3 shows that this system would recommend are Bones, Frasier, and CSI. Let’s look at how some of the top shows compare with their number of shared actors.
= plt.subplots(1, 1, figsize=(20, 10))
fig, ax
# create figure
= sns.barplot(
g =grouped_df[1:11],
data=grouped_df[1:11].index, y="actor")
x
# set axis labels and title
"Top 10 Movie/TV Show Recommendations")
g.set_title("Movie or TV Show name")
g.set_xlabel("Number of shared actors") g.set_ylabel(
Text(0, 0.5, 'Number of shared actors')