Building a content-based movie recommendation system using the MovieLens dataset

Introduction

In this project, we will build a movie recommendation system using content-based filtering. We will use the MovieLens dataset, provided in Kaggle, which contains movie and user data, to recommend movies to an user based on the genres of movies he has liked the most.

Resources

GitHub repository containing this project and a list of dependencies
Open this page in Google Colab
Read the contents in my website

The video below contains an overview of the method we will implement. However, unlike the example above, they use a method which gives more weight to genres a user has rated more movies in. We will fix that mistake and show the differences in both methods.

Content-based filtering overview

The idea of content-based filtering is simple.

represent items with a set of features
discover a user’s preferences to each of those features, based on items they have already rated/consumed/liked that contain the features
infer the user’s preference to unseen items based on those item features.

Unlike more sophisticated recommendation systems, such as collaborative filtering, content-based filtering only uses data provided by a single user to model their preferences. Thus it is much simpler to implement.

A concrete example

In our project, we will build a movie recommendation system, so let’s build an example using this theme. I will number the steps below and you can look for references to these tables in the Python implementation that comes right after.

We begin with a list of ratings a user has given to the movies they have watched.

Movie ID Movie Title User Rating

1 Inception 5.0

2 The Matrix 4.5

3 Interstellar 4.0
Each of the movies belongs to a list of genres.

Movie ID Movie Title Genres

1 Inception Sci-Fi, Thriller

2 The Matrix Sci-Fi, Action

3 Interstellar Sci-Fi, Drama
If we average the ratings the user has given for each movie by genre, we get the user’s ratings for each genre.
- Sci-Fi appears in all three movies, so its average rating is: \(\frac{(5.0 + 4.5 + 4.0)}{3} = 4.5\)
- Thriller appears only in Inception, so its average rating is: 5.0
- Action appears only in The Matrix, so its average rating is: 4.5
- Drama appears only in Interstellar, so its average rating is: 4.0
This gives us the ratings by genre for that particular user.

Genre Average Rating

Sci-Fi 4.5

Thriller 5.0

Action 4.5

Drama 4.0

In technical jargon, this is called the user profile and the component in your system that generates this is the user profiler.

Movie ID	Movie Title	User Rating
1	Inception	5.0
2	The Matrix	4.5
3	Interstellar	4.0

Movie ID	Movie Title	Genres
1	Inception	Sci-Fi, Thriller
2	The Matrix	Sci-Fi, Action
3	Interstellar	Sci-Fi, Drama

Genre	Average Rating
Sci-Fi	4.5
Thriller	5.0
Action	4.5
Drama	4.0

Now, let’s consider a new list of movies this user has not yet watched.

Movie ID	Movie Title	Genres
4	Blade Runner 2049	Sci-Fi, Thriller
5	Mad Max: Fury Road	Action, Sci-Fi
6	The Social Network	Drama
7	The Crow	Action, Horror
8	La La Land	Musical, Romantic Comedy

When we select and preprocess item features, keeping them ready to match against user profiles, we performing content analysis. In our example, we use movie genres as their features, transforming them into what are called tags, and we are not considering other information such as the movie description, its release date or its director. Some of these could also be transformed into tags, while others need to be processed using NLP techniques.

To estimate the current user’s ratings for these unwatched movies, we take the average of the user’s ratings for the movie’s genres:
- Blade Runner 2049 (Sci-Fi, Thriller)
  \(\frac{(4.5 + 5.0)}{2} = 4.75\)
- Mad Max: Fury Road (Action, Sci-Fi)
  \(\frac{(4.5 + 4.5)}{2} = 4.5\)
- The Social Network (Drama)
  - Drama’s average rating is 4.0, so the estimated rating is 4.0.
- The Crow (Action, Horror)
  - Because the user has not ranked any horror movies yet, we can average over the ratings of the genres we do have. In this case, only Action, whose estimated rating is 4.5.
- La La Land (Musical, Romantic Comedy)
  - The user has not rated any romantic comedies or musicals yet, so we are unable to estimate a rating for La La Land and thus we are unable to recommend it to this user.
Movie ID Movie Title Inferred Rating

4 Blade Runner 2049 4.75

5 Mad Max: Fury Road 4.5

6 The Social Network 4.0

7 The Crow 4.5

8 La La Land —

Movie ID	Movie Title	Inferred Rating
4	Blade Runner 2049	4.75
5	Mad Max: Fury Road	4.5
6	The Social Network	4.0
7	The Crow	4.5
8	La La Land	—

Now we know that, if we were to recommend new movies for that user to watch, we would recommend “Blade Runner 2049” first, then “Mad Max: Fury Road” or “The Crow” and “The Social Network” last.

The process of selecting ranked items for a given users is called content retrieval.

Maybe we would like to prioritize “Mad Max: Fury Road” over “The Crow” since we have more user preference information to estimate the rating of the first movie, but that is not strictly necessary and we can consider them tied until the user rates more movies.

Downloading the dataset

As you have seen above, we need a dataset that contains:

movies with their genres
user ratings for movies

Fortunately, the MovieLens dataset has both of these pieces of data. We will download it using the kagglehub package.

import pathlib as pl

import kagglehub
import pandas as pd

(mvls_links_path, mvls_genometags_path, mvls_movies_path, mvls_genomescores_path, mvls_tags_path, mvls_ratings_path) = (
  list(pl.Path(kagglehub.dataset_download("grouplens/movielens-20m-dataset")).iterdir())
)
print(
  mvls_links_path, mvls_genometags_path, mvls_movies_path, mvls_genomescores_path, mvls_tags_path, mvls_ratings_path
)

/home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/link.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/genome_tags.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/movie.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/genome_scores.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/tag.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/rating.csv

MovieLens contains a collection of CSV files. You can see below the movie-related data, which contains the movie and year of release in one column, and all of its genres on another. This equivalent to table 2 in the example.

movies = pd.read_csv(mvls_movies_path).set_index("movieId").sort_index()
movies.head()

	title	genres
movieId
1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2	Jumanji (1995)	Adventure\|Children\|Fantasy
3	Grumpier Old Men (1995)	Comedy\|Romance
4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
5	Father of the Bride Part II (1995)	Comedy

The first thing we will do is represent movie genres by one-hot encoded columns. We will also keep a record of the names of these columns for later.

# get a single list of all unique genres for all movies
genres = list(set(movies["genres"].str.split("|").sum()))
genres = sorted(genres)
movies = movies.join(movies["genres"].str.get_dummies(sep="|")).drop("genres", axis=1)
print(f"Unique genres: {genres}")
print(f"Number of genres: {len(genres)}")
movies.head()

Unique genres: ['(no genres listed)', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
Number of genres: 20

	title	(no genres listed)	Action	Adventure	Animation	Children	Comedy	Crime	Documentary	Drama	...	Film-Noir	Horror	IMAX	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
movieId
1	Toy Story (1995)	0	0	1	1	1	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	Jumanji (1995)	0	0	1	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	Grumpier Old Men (1995)	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	1	0	0	0	0
4	Waiting to Exhale (1995)	0	0	0	0	0	1	0	0	1	...	0	0	0	0	0	1	0	0	0	0
5	Father of the Bride Part II (1995)	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 21 columns

Regarding user ratings to movies, MovieLens provides us with 20 million ratings. In the cell below, you can see a few basic characteristics of user ratings. Most importantly, they range from 0.5 to 5 in intervals of 0.5

ratings = pd.read_csv(mvls_ratings_path)
ratings["rating"].value_counts()

rating
0    5561926
0    4291193
0    2898660
5    2200156
5    1534824
0    1430997
5     883398
0     680732
5     279252
5     239125
Name: count, dtype: int64

A user rating to a movie ties a user ID to a movie ID and contains the 0.5–5 rating the user has given the movie. In this project, we will work only with the user whose ID is 1. Let’s peek at some movies he has seen and their ratings.

We can see that user 1 has watched and rated 175 movies. That is a lot of movies. Also, the minimum rating user 1 has given to a movie is 3, far from the 0.5 it is allowed to give.

user_ratings = ratings[ratings["userId"] == 1]
ax = user_ratings["rating"].hist(bins=10, range=(0, 5))
ax.bar_label(ax.containers[0]);

png

This is what the user-movie rating data actually looks like. This is equivalent to table 1 in the example.

user_ratings.head()

	userId	movieId	rating	timestamp
0	1	2	3.5	2005-04-02 23:53:47
1	1	29	3.5	2005-04-02 23:31:16
2	1	32	3.5	2005-04-02 23:33:39
3	1	47	3.5	2005-04-02 23:32:07
4	1	50	3.5	2005-04-02 23:29:40

Computing user-genre preferences

Our goal now will be to generate the user’s features, which will be derived from the genres of the movies they have watched and their ratings.

To make things easier for us, we will join user 1’s movie ratings with the genres of those movies.

user_ratings = user_ratings[["userId", "movieId", "rating"]].merge(movies, on="movieId")
user_ratings.head()

	userId	movieId	rating	title	Adventure	Children	...	Mystery	Sci-Fi	Thriller
0	1	2	3.5	Jumanji (1995)	1	1	...	0	0	0
1	1	29	3.5	City of Lost Children, The (Cité des enfants p...	1	0	...	1	1	0
2	1	32	3.5	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)	0	0	...	1	1	1
3	1	47	3.5	Seven (a.k.a. Se7en) (1995)	0	0	...	1	0	1
4	1	50	3.5	Usual Suspects, The (1995)	0	0	...	1	0	1

5 rows × 24 columns

Now, we will propagate the user’s ratings to the genres of each movie. by multiplying the ratings user 1 has given to a movie with the genre columns of each movie. You will see that the one-hot encoding of that movie will become the actual ratings user 1 has given to the movie.

# multiply the genre columns by the rating
user_ratings[genres] = user_ratings[genres].multiply(user_ratings["rating"], axis="index")
user_ratings.head()

	userId	movieId	rating	title	Adventure	Children	...	Mystery	Sci-Fi	Thriller
0	1	2	3.5	Jumanji (1995)	3.5	3.5	...	0.0	0.0	0.0
1	1	29	3.5	City of Lost Children, The (Cité des enfants p...	3.5	0.0	...	3.5	3.5	0.0
2	1	32	3.5	Twelve Monkeys (a.k.a. 12 Monkeys) (1995)	0.0	0.0	...	3.5	3.5	3.5
3	1	47	3.5	Seven (a.k.a. Se7en) (1995)	0.0	0.0	...	3.5	0.0	3.5
4	1	50	3.5	Usual Suspects, The (1995)	0.0	0.0	...	3.5	0.0	3.5

5 rows × 24 columns

Now, if we sum all the ratings user 1 has given to the movies they have watched by genre, we get a proxy value that denotes how much user 1 enjoys each of the genres.

user_genre_preferences = pd.DataFrame(
  data={"Movies": (user_ratings[genres] != 0).sum(), "Sum of ratings": user_ratings[genres].sum()}
)
user_genre_preferences.sort_values("Sum of ratings", ascending=False)

	Movies	Sum of ratings
Adventure	73	276.5
Fantasy	69	261.5
Action	66	246.0
Horror	45	168.5
Drama	43	162.0
Thriller	42	158.0
Comedy	41	153.0
Sci-Fi	40	148.5
Crime	21	80.0
Children	19	68.5
Mystery	18	65.0
Romance	11	43.5
Animation	10	36.5
War	9	33.0
Western	4	13.5
Musical	3	11.0
IMAX	2	8.5
(no genres listed)	0	0.0
Documentary	0	0.0
Film-Noir	0	0.0

Because the magnitudes of these sums of ratings can get out of hand, we can divide them by the sum of all ratings and get neat values that all add up to 1.

user_genre_preferences["Normalized sum of ratings"] = (
  user_genre_preferences["Sum of ratings"] / user_genre_preferences["Sum of ratings"].max()
)
user_genre_preferences.sort_values("Normalized sum of ratings", ascending=False)

	Movies	Sum of ratings	Normalized sum of ratings
Adventure	73	276.5	1.000000
Fantasy	69	261.5	0.945750
Action	66	246.0	0.889693
Horror	45	168.5	0.609403
Drama	43	162.0	0.585895
Thriller	42	158.0	0.571429
Comedy	41	153.0	0.553345
Sci-Fi	40	148.5	0.537071
Crime	21	80.0	0.289331
Children	19	68.5	0.247740
Mystery	18	65.0	0.235081
Romance	11	43.5	0.157324
Animation	10	36.5	0.132007
War	9	33.0	0.119349
Western	4	13.5	0.048825
Musical	3	11.0	0.039783
IMAX	2	8.5	0.030741
(no genres listed)	0	0.0	0.000000
Documentary	0	0.0	0.000000
Film-Noir	0	0.0	0.000000

These values are interesting and they may be used as an indication of how much a user prefers certain movie genres. This is the way that is taught in the video at the beginning of the page.

However, because this method just sums all ratings the user has given to a collection of movies, it favors those genres in which the user has given more ratings, regardless of how low the ratings are.

For example, if a user watches 9 action movies and gives a rating of 1 to all of them, the Action genre will have a weight of 9 in our preference vector:

1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 9

If that same user watches 2 musicals (I know, who likes musicals) but gives a rating of 4 to both of them, the Musical genre will have a weight of 8 in our fetaure vector, which is less than the weight of the Action genre.

4 + 4 = 8

But we can clearly see that this user hates action movies but loves musicals!

Our hypothesis is corroborated in the chart below, in which our sum of ratings for a genre grows as the user watches more movies.

user_genre_preferences[["Movies", "Sum of ratings"]].sort_values("Movies", ascending=False).plot.barh()

<Axes: >

png

By looking at the correlations between number of movies watched, sum of ratings and the normalized sum of ratings, we can see a pretty hgh positive correlation between all of them.

user_genre_preferences.corr()

	Movies	Sum of ratings	Normalized sum of ratings
Movies	1.000000	0.999871	0.999871
Sum of ratings	0.999871	1.000000	1.000000
Normalized sum of ratings	0.999871	1.000000	1.000000

One way we can mitigate this is by taking the average of the ratings a user gives to the movies of a particular genre as their preference for that genre. The table below is the one presented in step 3 of the example.

user_genre_preferences["Normalized by movies in genre"] = (
  user_genre_preferences["Sum of ratings"] / user_genre_preferences["Movies"]
).fillna(0)
user_genre_preferences = user_genre_preferences.sort_values("Normalized by movies in genre", ascending=False)
user_genre_preferences

	Movies	Sum of ratings	Normalized sum of ratings	Normalized by movies in genre
IMAX	2	8.5	0.030741	4.250000
Romance	11	43.5	0.157324	3.954545
Crime	21	80.0	0.289331	3.809524
Fantasy	69	261.5	0.945750	3.789855
Adventure	73	276.5	1.000000	3.787671
Drama	43	162.0	0.585895	3.767442
Thriller	42	158.0	0.571429	3.761905
Horror	45	168.5	0.609403	3.744444
Comedy	41	153.0	0.553345	3.731707
Action	66	246.0	0.889693	3.727273
Sci-Fi	40	148.5	0.537071	3.712500
War	9	33.0	0.119349	3.666667
Musical	3	11.0	0.039783	3.666667
Animation	10	36.5	0.132007	3.650000
Mystery	18	65.0	0.235081	3.611111
Children	19	68.5	0.247740	3.605263
Western	4	13.5	0.048825	3.375000
(no genres listed)	0	0.0	0.000000	0.000000
Documentary	0	0.0	0.000000	0.000000
Film-Noir	0	0.0	0.000000	0.000000

If we now look at the correlation of the two feature sets implemented, we can see they are mostly uncorrelated, which means they are two completely different ways of expressing user preference.

user_genre_preferences[["Normalized sum of ratings", "Normalized by movies in genre"]].corr()

	Normalized sum of ratings	Normalized by movies in genre
Normalized sum of ratings	1.000000	0.451363
Normalized by movies in genre	0.451363	1.000000

As you can see above, the second choice of features is very interesting. While before, we thought that the user enjoyed the Adventure, Fantasy Action and Horror genres, now we belive they enjoy IMAX movies, romances and crimes, even though they have watched less movies in that genre.

One thing to note is that not all genres have inferred ratings for the given user. In cases where the user has not rated any movie of a particular genre, that genre will not have a rating. In our case, since MovieLens ratings start at 0.5, we use 0 to denote the absence of ratings.

genres_with_ratings = (user_genre_preferences["Normalized by movies in genre"] > 0).sum()
genres_without_ratings = len(user_genre_preferences) - genres_with_ratings

print(f"Total genres: {len(user_genre_preferences)}\nNumber of genres with ratings for user: {genres_with_ratings}")

Total genres: 20
Number of genres with ratings for user: 17

Inferring preferences for unwatched movies

Now that we know how much the user prefers each genre, we will compute the ratings for movies that user has not watched yet.

First, let’s get all the movies they have not watched from MovieLens. This would be the table from step 4 in the example.

# get the movies that the user has not seen
unwatched_movies = movies.loc[~movies.index.isin(user_ratings["movieId"])].copy()
unwatched_movies

	title	(no genres listed)	Action	Adventure	Animation	Children	Comedy	Crime	Documentary	Drama	...	Film-Noir	Horror	IMAX	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
movieId
1	Toy Story (1995)	0	0	1	1	1	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
3	Grumpier Old Men (1995)	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	1	0	0	0	0
4	Waiting to Exhale (1995)	0	0	0	0	0	1	0	0	1	...	0	0	0	0	0	1	0	0	0	0
5	Father of the Bride Part II (1995)	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
6	Heat (1995)	0	1	0	0	0	0	1	0	0	...	0	0	0	0	0	0	0	1	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
131254	Kein Bund für's Leben (2007)	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
131256	Feuer, Eis & Dosenbier (2002)	0	0	0	0	0	1	0	0	0	...	0	0	0	0	0	0	0	0	0	0
131258	The Pirates (2014)	0	0	1	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
131260	Rentun Ruusu (2001)	1	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
131262	Innocence (2014)	0	0	1	0	0	0	0	0	0	...	0	1	0	0	0	0	0	0	0	0

27103 rows × 21 columns

As an intermediate step, we will to propagate the user’s genre preferences to the preferences of these unwatched movies.

movie_genre_ratings = unwatched_movies[genres].multiply(user_genre_preferences["Normalized by movies in genre"])
movie_genre_ratings

	(no genres listed)	Action	Adventure	Animation	Children	Comedy	Crime	Documentary	Drama	Fantasy	Film-Noir	Horror	IMAX	Musical	Mystery	Romance	Sci-Fi	Thriller	War	Western
movieId
1	0.0	0.000000	3.787671	3.65	3.605263	3.731707	0.000000	0.0	0.000000	3.789855	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
3	0.0	0.000000	0.000000	0.00	0.000000	3.731707	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.0	0.0	3.954545	0.0	0.000000	0.0	0.0
4	0.0	0.000000	0.000000	0.00	0.000000	3.731707	0.000000	0.0	3.767442	0.000000	0.0	0.000000	0.0	0.0	0.0	3.954545	0.0	0.000000	0.0	0.0
5	0.0	0.000000	0.000000	0.00	0.000000	3.731707	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
6	0.0	3.727273	0.000000	0.00	0.000000	0.000000	3.809524	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	3.761905	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
131254	0.0	0.000000	0.000000	0.00	0.000000	3.731707	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
131256	0.0	0.000000	0.000000	0.00	0.000000	3.731707	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
131258	0.0	0.000000	3.787671	0.00	0.000000	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
131260	0.0	0.000000	0.000000	0.00	0.000000	0.000000	0.000000	0.0	0.000000	0.000000	0.0	0.000000	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0
131262	0.0	0.000000	3.787671	0.00	0.000000	0.000000	0.000000	0.0	0.000000	3.789855	0.0	3.744444	0.0	0.0	0.0	0.000000	0.0	0.000000	0.0	0.0

27103 rows × 20 columns

We then sum these preferences by movie and divide by the number of genres each movie has, to get the user preferences for each unwatched movie.

If we sort the movies by these inferred preferences, we get a list in descending order of the movies user 1 has not seen but might enjoy, based on their genres and the ratings user 1 has given to the movies they have seen.

denominator = movie_genre_ratings != 0
numerator = movie_genre_ratings.sum(axis=1)
content_based_ratings = numerator / denominator.sum(axis=1)
unwatched_movies["content_based_rating"] = content_based_ratings
unwatched_movies[["title", "content_based_rating"]].sort_values("content_based_rating", ascending=False)

	title	content_based_rating
movieId
4858	Hail Columbia! (1982)	4.25
4861	Mission to Mir (1997)	4.25
4459	Alaska: Spirit of the Wild (1997)	4.25
4460	Encounter in the Third Dimension (1999)	4.25
4461	Siegfried & Roy: The Magic Box (1999)	4.25
...	...	...
131108	The Fearless Four (1997)	NaN
131110	A House of Secrets: Exploring 'Dragonwyck' (2008)	NaN
131166	WWII IN HD (2009)	NaN
131172	Closed Curtain (2013)	NaN
131260	Rentun Ruusu (2001)	NaN

27103 rows × 2 columns

Notice that this step already takes care of situations in which a movie belongs to a certain genre but the user has no rating for that genre. For example, this user has a rating for the Action gente, but not for Western. If movie belongs to both genres, the average will be taken only over one genre, since content_based_ratings.loc[movie_id, "<genre without user rating>"] will be equal to 0 and ignored when accounting for our denominator.

We can also compute the alternate user-genre preferences, using normalized sums of ratings, which are highly correlated with the number of movies watched in each genre. You can see we get a different list of movies, although some movies remain in the list and the first movie is the same.

content_based_ratings = unwatched_movies[genres].multiply(user_genre_preferences["Normalized sum of ratings"])
content_based_ratings = content_based_ratings.sum(axis=1)
unwatched_movies["content_based_rating2"] = content_based_ratings
unwatched_movies[["title", "content_based_rating2"]].sort_values("content_based_rating2", ascending=False)

	title	content_based_rating2
movieId
81132	Rubber (2010)	4.783002
49593	She (1965)	4.725136
5018	Motorama (1991)	4.717902
71999	Aelita: The Queen of Mars (Aelita) (1924)	4.687161
72165	Cirque du Freak: The Vampire's Assistant (2009)	4.569620
...	...	...
100246	Pretty Sweet (2012)	0.000000
99	Heidi Fleiss: Hollywood Madam (1995)	0.000000
100266	Day Is Done (2011)	0.000000
100287	Head Games (2012)	0.000000
96842	Behind the Burly Q: The Story of Burlesque in ...	0.000000

27103 rows × 2 columns

Both of the tables above are equivalent to the table presented in step 5 of the example.

Discussion

One disadvantage of content-based filtering is that we can only recommend an item for a user if that user has rated at least one other item that contains at least one feature in common with the item in question.

In our case, we can only recommend movies to a user if that movie contains at least one genre in common with another movie that the user has already rated.

The fewer features an unrated item has with rated ones, the less granular its inferred rating will be.

The diagrams below illustrate a situation in which a user has ratings for 17 genres, computed from the ratings of movies that contain those genres. In the first example, a movie with 5 genres has 2 genres in common with rated genres for the user, so we can compute an inferred rating for the movie.

In the second example, a movie with 5 genres has 0 genres in common with the rated genres for the user, so we cannot directly compute an inferred rating for it.

from matplotlib_venn import venn2
import matplotlib.pyplot as plt

movie_genres_count = 5
intersection = 2
set1 = genres_with_ratings - intersection
set2 = movie_genres_count - intersection
venn2(subsets=(set1, set2, intersection), set_labels=("Features with\nuser ratings", "Movie features"), alpha=0.5)
plt.title(f"Movie that can be rated based on {intersection} features")
plt.show()
venn2(
  subsets=(genres_with_ratings, movie_genres_count, 0),
  set_labels=("Features with\nuser ratings", "Movie features"),
  alpha=0.5,
)
plt.title("Movie that cannot be rated")
plt.show()

png

Because in our example, this user has watched over 100 movies, we are able to recommend over 90% of unwatched movies to them.

movies_with_inferred_ratings = (~unwatched_movies["content_based_rating"].isna()).sum()
print(f"Number of unwatched movies: {len(unwatched_movies)}")
print(f"Number of unwatched movies with inferred ratings: {movies_with_inferred_ratings}")
print(
  f"Percentage of unwatched movies with inferred ratings: {movies_with_inferred_ratings / len(unwatched_movies):.2%}"
)

Number of unwatched movies: 27103
Number of unwatched movies with inferred ratings: 24901
Percentage of unwatched movies with inferred ratings: 91.88%

Another disadvantage is that, depending on the number of features and technique employed in computing inferred ratings, the granularity of ratings can be very poor.

In our case, although we inferred ratings for over 24,000 movies, over 50% of them received one of 10 ratings.

inferred_ratings_count = unwatched_movies["content_based_rating"].value_counts()
topk = 10

print(f"Unique inferred ratings: {len(inferred_ratings_count)}")
print(
  f"Percentage of unwatched movies in the top {topk} inferred ratings: {(inferred_ratings_count.iloc[:topk].sum() / movies_with_inferred_ratings):.2%}"
)

Unique inferred ratings: 1182
Percentage of unwatched movies in the top 10 inferred ratings: 51.42%

This happens because, while we have \(n\) total features to represent user taste and items (the movie genres), the current user only has ratings for a subset \(m \leq n\) of the features.

Using a simple average over genre ratings, a movie’s inferred rating can only come from one of \(2^m-1\) possible values.

If the number of genres of the movie is known, say \(k\), then this value is restricted to \(\binom{m}{k}\), assuming that each possible rating is different.

The table below shows the number of possible ratings a movie may have, given the number of genres it has.

from scipy.special import comb

genres_count = unwatched_movies[genres].sum(axis=1).value_counts()
possible_ratings = pd.DataFrame(
  index=genres_count.index,
  data={
    "Number of Movies": genres_count,
    "Possible ratings": [comb(genres_with_ratings, i) for i in genres_count.index],
  },
)
possible_ratings["Movies per possible rating"] = (
  possible_ratings["Number of Movies"] / possible_ratings["Possible ratings"]
)
possible_ratings.index.name = "Number of genres"
possible_ratings

	Number of Movies	Possible ratings	Movies per possible rating
Number of genres
1	10816	17.0	636.235294
2	8766	136.0	64.455882
3	5260	680.0	7.735294
4	1685	2380.0	0.707983
5	468	6188.0	0.075630
6	82	12376.0	0.006626
7	20	19448.0	0.001028
8	5	24310.0	0.000206
10	1	19448.0	0.000051

Conclusion

In this project, we have built a movie recommendation system based on content-based filtering.

We went through the theory of content-based filtering as well as an example, then used a publicly available dataset to implement the system. We compared two ways of building user features and saw one of the downsides of content-based filtering, which is the inability of recommending new items to a user if those items don’t have intersecting features with the items the user has already rated.

See you guys next time!