Building a content-based movie recommendation system using the MovieLens dataset

Introduction

In this project, we will build a movie recommendation system using content-based filtering. We will use the MovieLens dataset, provided in Kaggle, which contains movie and user data, to recommend movies to an user based on the genres of movies he has liked the most.

Resources

The video below contains an overview of the method we will implement. However, unlike the example above, they use a method which gives more weight to genres a user has rated more movies in. We will fix that mistake and show the differences in both methods.

Content-based filtering overview

The idea of content-based filtering is simple.

  1. represent items with a set of features
  2. discover a user’s preferences to each of those features, based on items they have already rated/consumed/liked that contain the features
  3. infer the user’s preference to unseen items based on those item features.

Unlike more sophisticated recommendation systems, such as collaborative filtering, content-based filtering only uses data provided by a single user to model their preferences. Thus it is much simpler to implement.

A concrete example

In our project, we will build a movie recommendation system, so let’s build an example using this theme. I will number the steps below and you can look for references to these tables in the Python implementation that comes right after.

  1. We begin with a list of ratings a user has given to the movies they have watched.

    Movie ID Movie Title User Rating
    1 Inception 5.0
    2 The Matrix 4.5
    3 Interstellar 4.0
  2. Each of the movies belongs to a list of genres.

    Movie ID Movie Title Genres
    1 Inception Sci-Fi, Thriller
    2 The Matrix Sci-Fi, Action
    3 Interstellar Sci-Fi, Drama
  3. If we average the ratings the user has given for each movie by genre, we get the user’s ratings for each genre.

    • Sci-Fi appears in all three movies, so its average rating is: \(\frac{(5.0 + 4.5 + 4.0)}{3} = 4.5\)
    • Thriller appears only in Inception, so its average rating is: 5.0
    • Action appears only in The Matrix, so its average rating is: 4.5
    • Drama appears only in Interstellar, so its average rating is: 4.0

    This gives us the ratings by genre for that particular user.

    Genre Average Rating
    Sci-Fi 4.5
    Thriller 5.0
    Action 4.5
    Drama 4.0

    In technical jargon, this is called the user profile and the component in your system that generates this is the user profiler.

  4. Now, let’s consider a new list of movies this user has not yet watched.

    Movie ID Movie Title Genres
    4 Blade Runner 2049 Sci-Fi, Thriller
    5 Mad Max: Fury Road Action, Sci-Fi
    6 The Social Network Drama
    7 The Crow Action, Horror
    8 La La Land Musical, Romantic Comedy

When we select and preprocess item features, keeping them ready to match against user profiles, we performing content analysis. In our example, we use movie genres as their features, transforming them into what are called tags, and we are not considering other information such as the movie description, its release date or its director. Some of these could also be transformed into tags, while others need to be processed using NLP techniques.

  1. To estimate the current user’s ratings for these unwatched movies, we take the average of the user’s ratings for the movie’s genres:

    • Blade Runner 2049 (Sci-Fi, Thriller)
      \(\frac{(4.5 + 5.0)}{2} = 4.75\)
    • Mad Max: Fury Road (Action, Sci-Fi)
      \(\frac{(4.5 + 4.5)}{2} = 4.5\)
    • The Social Network (Drama)
      • Drama’s average rating is 4.0, so the estimated rating is 4.0.
    • The Crow (Action, Horror)
      • Because the user has not ranked any horror movies yet, we can average over the ratings of the genres we do have. In this case, only Action, whose estimated rating is 4.5.
    • La La Land (Musical, Romantic Comedy)
      • The user has not rated any romantic comedies or musicals yet, so we are unable to estimate a rating for La La Land and thus we are unable to recommend it to this user.
    Movie ID Movie Title Inferred Rating
    4 Blade Runner 2049 4.75
    5 Mad Max: Fury Road 4.5
    6 The Social Network 4.0
    7 The Crow 4.5
    8 La La Land

Now we know that, if we were to recommend new movies for that user to watch, we would recommend “Blade Runner 2049” first, then “Mad Max: Fury Road” or “The Crow” and “The Social Network” last.

The process of selecting ranked items for a given users is called content retrieval.

Maybe we would like to prioritize “Mad Max: Fury Road” over “The Crow” since we have more user preference information to estimate the rating of the first movie, but that is not strictly necessary and we can consider them tied until the user rates more movies.

Downloading the dataset

As you have seen above, we need a dataset that contains:

  1. movies with their genres
  2. user ratings for movies

Fortunately, the MovieLens dataset has both of these pieces of data. We will download it using the kagglehub package.

import pathlib as pl

import kagglehub
import pandas as pd

(mvls_links_path, mvls_genometags_path, mvls_movies_path, mvls_genomescores_path, mvls_tags_path, mvls_ratings_path) = (
  list(pl.Path(kagglehub.dataset_download("grouplens/movielens-20m-dataset")).iterdir())
)
print(
  mvls_links_path, mvls_genometags_path, mvls_movies_path, mvls_genomescores_path, mvls_tags_path, mvls_ratings_path
)
/home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/link.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/genome_tags.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/movie.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/genome_scores.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/tag.csv /home/dodo/.cache/kagglehub/datasets/grouplens/movielens-20m-dataset/versions/1/rating.csv

MovieLens contains a collection of CSV files. You can see below the movie-related data, which contains the movie and year of release in one column, and all of its genres on another. This equivalent to table 2 in the example.

movies = pd.read_csv(mvls_movies_path).set_index("movieId").sort_index()
movies.head()
title genres
movieId
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 Father of the Bride Part II (1995) Comedy

The first thing we will do is represent movie genres by one-hot encoded columns. We will also keep a record of the names of these columns for later.

# get a single list of all unique genres for all movies
genres = list(set(movies["genres"].str.split("|").sum()))
genres = sorted(genres)
movies = movies.join(movies["genres"].str.get_dummies(sep="|")).drop("genres", axis=1)
print(f"Unique genres: {genres}")
print(f"Number of genres: {len(genres)}")
movies.head()
Unique genres: ['(no genres listed)', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'IMAX', 'Musical', 'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western']
Number of genres: 20
title (no genres listed) Action Adventure Animation Children Comedy Crime Documentary Drama ... Film-Noir Horror IMAX Musical Mystery Romance Sci-Fi Thriller War Western
movieId
1 Toy Story (1995) 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 Jumanji (1995) 0 0 1 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Grumpier Old Men (1995) 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 Waiting to Exhale (1995) 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
5 Father of the Bride Part II (1995) 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 21 columns

Regarding user ratings to movies, MovieLens provides us with 20 million ratings. In the cell below, you can see a few basic characteristics of user ratings. Most importantly, they range from 0.5 to 5 in intervals of 0.5

ratings = pd.read_csv(mvls_ratings_path)
ratings["rating"].value_counts()
rating
4.0    5561926
3.0    4291193
5.0    2898660
3.5    2200156
4.5    1534824
2.0    1430997
2.5     883398
1.0     680732
1.5     279252
0.5     239125
Name: count, dtype: int64

A user rating to a movie ties a user ID to a movie ID and contains the 0.5–5 rating the user has given the movie. In this project, we will work only with the user whose ID is 1. Let’s peek at some movies he has seen and their ratings.

We can see that user 1 has watched and rated 175 movies. That is a lot of movies. Also, the minimum rating user 1 has given to a movie is 3, far from the 0.5 it is allowed to give.

user_ratings = ratings[ratings["userId"] == 1]
ax = user_ratings["rating"].hist(bins=10, range=(0, 5))
ax.bar_label(ax.containers[0]);

png

This is what the user-movie rating data actually looks like. This is equivalent to table 1 in the example.

user_ratings.head()
userId movieId rating timestamp
0 1 2 3.5 2005-04-02 23:53:47
1 1 29 3.5 2005-04-02 23:31:16
2 1 32 3.5 2005-04-02 23:33:39
3 1 47 3.5 2005-04-02 23:32:07
4 1 50 3.5 2005-04-02 23:29:40

Computing user-genre preferences

Our goal now will be to generate the user’s features, which will be derived from the genres of the movies they have watched and their ratings.

To make things easier for us, we will join user 1’s movie ratings with the genres of those movies.

user_ratings = user_ratings[["userId", "movieId", "rating"]].merge(movies, on="movieId")
user_ratings.head()
userId movieId rating title (no genres listed) Action Adventure Animation Children Comedy ... Film-Noir Horror IMAX Musical Mystery Romance Sci-Fi Thriller War Western
0 1 2 3.5 Jumanji (1995) 0 0 1 0 1 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 29 3.5 City of Lost Children, The (Cité des enfants p... 0 0 1 0 0 0 ... 0 0 0 0 1 0 1 0 0 0
2 1 32 3.5 Twelve Monkeys (a.k.a. 12 Monkeys) (1995) 0 0 0 0 0 0 ... 0 0 0 0 1 0 1 1 0 0
3 1 47 3.5 Seven (a.k.a. Se7en) (1995) 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 1 0 0
4 1 50 3.5 Usual Suspects, The (1995) 0 0 0 0 0 0 ... 0 0 0 0 1 0 0 1 0 0

5 rows × 24 columns

Now, we will propagate the user’s ratings to the genres of each movie. by multiplying the ratings user 1 has given to a movie with the genre columns of each movie. You will see that the one-hot encoding of that movie will become the actual ratings user 1 has given to the movie.

# multiply the genre columns by the rating
user_ratings[genres] = user_ratings[genres].multiply(user_ratings["rating"], axis="index")
user_ratings.head()
userId movieId rating title (no genres listed) Action Adventure Animation Children Comedy ... Film-Noir Horror IMAX Musical Mystery Romance Sci-Fi Thriller War Western
0 1 2 3.5 Jumanji (1995) 0.0 0.0 3.5 0.0 3.5 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 1 29 3.5 City of Lost Children, The (Cité des enfants p... 0.0 0.0 3.5 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.5 0.0 3.5 0.0 0.0 0.0
2 1 32 3.5 Twelve Monkeys (a.k.a. 12 Monkeys) (1995) 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.5 0.0 3.5 3.5 0.0 0.0
3 1 47 3.5 Seven (a.k.a. Se7en) (1995) 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.5 0.0 0.0 3.5 0.0 0.0
4 1 50 3.5 Usual Suspects, The (1995) 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 3.5 0.0 0.0 3.5 0.0 0.0

5 rows × 24 columns

Now, if we sum all the ratings user 1 has given to the movies they have watched by genre, we get a proxy value that denotes how much user 1 enjoys each of the genres.

user_genre_preferences = pd.DataFrame(
  data={"Movies": (user_ratings[genres] != 0).sum(), "Sum of ratings": user_ratings[genres].sum()}
)
user_genre_preferences.sort_values("Sum of ratings", ascending=False)
Movies Sum of ratings
Adventure 73 276.5
Fantasy 69 261.5
Action 66 246.0
Horror 45 168.5
Drama 43 162.0
Thriller 42 158.0
Comedy 41 153.0
Sci-Fi 40 148.5
Crime 21 80.0
Children 19 68.5
Mystery 18 65.0
Romance 11 43.5
Animation 10 36.5
War 9 33.0
Western 4 13.5
Musical 3 11.0
IMAX 2 8.5
(no genres listed) 0 0.0
Documentary 0 0.0
Film-Noir 0 0.0

Because the magnitudes of these sums of ratings can get out of hand, we can divide them by the sum of all ratings and get neat values that all add up to 1.

user_genre_preferences["Normalized sum of ratings"] = (
  user_genre_preferences["Sum of ratings"] / user_genre_preferences["Sum of ratings"].max()
)
user_genre_preferences.sort_values("Normalized sum of ratings", ascending=False)
Movies Sum of ratings Normalized sum of ratings
Adventure 73 276.5 1.000000
Fantasy 69 261.5 0.945750
Action 66 246.0 0.889693
Horror 45 168.5 0.609403
Drama 43 162.0 0.585895
Thriller 42 158.0 0.571429
Comedy 41 153.0 0.553345
Sci-Fi 40 148.5 0.537071
Crime 21 80.0 0.289331
Children 19 68.5 0.247740
Mystery 18 65.0 0.235081
Romance 11 43.5 0.157324
Animation 10 36.5 0.132007
War 9 33.0 0.119349
Western 4 13.5 0.048825
Musical 3 11.0 0.039783
IMAX 2 8.5 0.030741
(no genres listed) 0 0.0 0.000000
Documentary 0 0.0 0.000000
Film-Noir 0 0.0 0.000000

These values are interesting and they may be used as an indication of how much a user prefers certain movie genres. This is the way that is taught in the video at the beginning of the page.

However, because this method just sums all ratings the user has given to a collection of movies, it favors those genres in which the user has given more ratings, regardless of how low the ratings are.

For example, if a user watches 9 action movies and gives a rating of 1 to all of them, the Action genre will have a weight of 9 in our preference vector:

1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 = 9

If that same user watches 2 musicals (I know, who likes musicals) but gives a rating of 4 to both of them, the Musical genre will have a weight of 8 in our fetaure vector, which is less than the weight of the Action genre.

4 + 4 = 8

But we can clearly see that this user hates action movies but loves musicals!

Our hypothesis is corroborated in the chart below, in which our sum of ratings for a genre grows as the user watches more movies.

user_genre_preferences[["Movies", "Sum of ratings"]].sort_values("Movies", ascending=False).plot.barh()
<Axes: >

png

By looking at the correlations between number of movies watched, sum of ratings and the normalized sum of ratings, we can see a pretty hgh positive correlation between all of them.

user_genre_preferences.corr()
Movies Sum of ratings Normalized sum of ratings
Movies 1.000000 0.999871 0.999871
Sum of ratings 0.999871 1.000000 1.000000
Normalized sum of ratings 0.999871 1.000000 1.000000

One way we can mitigate this is by taking the average of the ratings a user gives to the movies of a particular genre as their preference for that genre. The table below is the one presented in step 3 of the example.

user_genre_preferences["Normalized by movies in genre"] = (
  user_genre_preferences["Sum of ratings"] / user_genre_preferences["Movies"]
).fillna(0)
user_genre_preferences = user_genre_preferences.sort_values("Normalized by movies in genre", ascending=False)
user_genre_preferences
Movies Sum of ratings Normalized sum of ratings Normalized by movies in genre
IMAX 2 8.5 0.030741 4.250000
Romance 11 43.5 0.157324 3.954545
Crime 21 80.0 0.289331 3.809524
Fantasy 69 261.5 0.945750 3.789855
Adventure 73 276.5 1.000000 3.787671
Drama 43 162.0 0.585895 3.767442
Thriller 42 158.0 0.571429 3.761905
Horror 45 168.5 0.609403 3.744444
Comedy 41 153.0 0.553345 3.731707
Action 66 246.0 0.889693 3.727273
Sci-Fi 40 148.5 0.537071 3.712500
War 9 33.0 0.119349 3.666667
Musical 3 11.0 0.039783 3.666667
Animation 10 36.5 0.132007 3.650000
Mystery 18 65.0 0.235081 3.611111
Children 19 68.5 0.247740 3.605263
Western 4 13.5 0.048825 3.375000
(no genres listed) 0 0.0 0.000000 0.000000
Documentary 0 0.0 0.000000 0.000000
Film-Noir 0 0.0 0.000000 0.000000

If we now look at the correlation of the two feature sets implemented, we can see they are mostly uncorrelated, which means they are two completely different ways of expressing user preference.

user_genre_preferences[["Normalized sum of ratings", "Normalized by movies in genre"]].corr()
Normalized sum of ratings Normalized by movies in genre
Normalized sum of ratings 1.000000 0.451363
Normalized by movies in genre 0.451363 1.000000

As you can see above, the second choice of features is very interesting. While before, we thought that the user enjoyed the Adventure, Fantasy Action and Horror genres, now we belive they enjoy IMAX movies, romances and crimes, even though they have watched less movies in that genre.

One thing to note is that not all genres have inferred ratings for the given user. In cases where the user has not rated any movie of a particular genre, that genre will not have a rating. In our case, since MovieLens ratings start at 0.5, we use 0 to denote the absence of ratings.

genres_with_ratings = (user_genre_preferences["Normalized by movies in genre"] > 0).sum()
genres_without_ratings = len(user_genre_preferences) - genres_with_ratings

print(f"Total genres: {len(user_genre_preferences)}\nNumber of genres with ratings for user: {genres_with_ratings}")
Total genres: 20
Number of genres with ratings for user: 17

Inferring preferences for unwatched movies

Now that we know how much the user prefers each genre, we will compute the ratings for movies that user has not watched yet.

First, let’s get all the movies they have not watched from MovieLens. This would be the table from step 4 in the example.

# get the movies that the user has not seen
unwatched_movies = movies.loc[~movies.index.isin(user_ratings["movieId"])].copy()
unwatched_movies
title (no genres listed) Action Adventure Animation Children Comedy Crime Documentary Drama ... Film-Noir Horror IMAX Musical Mystery Romance Sci-Fi Thriller War Western
movieId
1 Toy Story (1995) 0 0 1 1 1 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 Grumpier Old Men (1995) 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 0 0 0 0
4 Waiting to Exhale (1995) 0 0 0 0 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
5 Father of the Bride Part II (1995) 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
6 Heat (1995) 0 1 0 0 0 0 1 0 0 ... 0 0 0 0 0 0 0 1 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
131254 Kein Bund für's Leben (2007) 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
131256 Feuer, Eis & Dosenbier (2002) 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
131258 The Pirates (2014) 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
131260 Rentun Ruusu (2001) 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
131262 Innocence (2014) 0 0 1 0 0 0 0 0 0 ... 0 1 0 0 0 0 0 0 0 0

27103 rows × 21 columns

As an intermediate step, we will to propagate the user’s genre preferences to the preferences of these unwatched movies.

movie_genre_ratings = unwatched_movies[genres].multiply(user_genre_preferences["Normalized by movies in genre"])
movie_genre_ratings
(no genres listed) Action Adventure Animation Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror IMAX Musical Mystery Romance Sci-Fi Thriller War Western
movieId
1 0.0 0.000000 3.787671 3.65 3.605263 3.731707 0.000000 0.0 0.000000 3.789855 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0
3 0.0 0.000000 0.000000 0.00 0.000000 3.731707 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 3.954545 0.0 0.000000 0.0 0.0
4 0.0 0.000000 0.000000 0.00 0.000000 3.731707 0.000000 0.0 3.767442 0.000000 0.0 0.000000 0.0 0.0 0.0 3.954545 0.0 0.000000 0.0 0.0
5 0.0 0.000000 0.000000 0.00 0.000000 3.731707 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0
6 0.0 3.727273 0.000000 0.00 0.000000 0.000000 3.809524 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 3.761905 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
131254 0.0 0.000000 0.000000 0.00 0.000000 3.731707 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0
131256 0.0 0.000000 0.000000 0.00 0.000000 3.731707 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0
131258 0.0 0.000000 3.787671 0.00 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0
131260 0.0 0.000000 0.000000 0.00 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.000000 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0
131262 0.0 0.000000 3.787671 0.00 0.000000 0.000000 0.000000 0.0 0.000000 3.789855 0.0 3.744444 0.0 0.0 0.0 0.000000 0.0 0.000000 0.0 0.0

27103 rows × 20 columns

We then sum these preferences by movie and divide by the number of genres each movie has, to get the user preferences for each unwatched movie.

If we sort the movies by these inferred preferences, we get a list in descending order of the movies user 1 has not seen but might enjoy, based on their genres and the ratings user 1 has given to the movies they have seen.

denominator = movie_genre_ratings != 0
numerator = movie_genre_ratings.sum(axis=1)
content_based_ratings = numerator / denominator.sum(axis=1)
unwatched_movies["content_based_rating"] = content_based_ratings
unwatched_movies[["title", "content_based_rating"]].sort_values("content_based_rating", ascending=False)
title content_based_rating
movieId
4858 Hail Columbia! (1982) 4.25
4861 Mission to Mir (1997) 4.25
4459 Alaska: Spirit of the Wild (1997) 4.25
4460 Encounter in the Third Dimension (1999) 4.25
4461 Siegfried & Roy: The Magic Box (1999) 4.25
... ... ...
131108 The Fearless Four (1997) NaN
131110 A House of Secrets: Exploring 'Dragonwyck' (2008) NaN
131166 WWII IN HD (2009) NaN
131172 Closed Curtain (2013) NaN
131260 Rentun Ruusu (2001) NaN

27103 rows × 2 columns

Notice that this step already takes care of situations in which a movie belongs to a certain genre but the user has no rating for that genre. For example, this user has a rating for the Action gente, but not for Western. If movie belongs to both genres, the average will be taken only over one genre, since content_based_ratings.loc[movie_id, "<genre without user rating>"] will be equal to 0 and ignored when accounting for our denominator.

We can also compute the alternate user-genre preferences, using normalized sums of ratings, which are highly correlated with the number of movies watched in each genre. You can see we get a different list of movies, although some movies remain in the list and the first movie is the same.

content_based_ratings = unwatched_movies[genres].multiply(user_genre_preferences["Normalized sum of ratings"])
content_based_ratings = content_based_ratings.sum(axis=1)
unwatched_movies["content_based_rating2"] = content_based_ratings
unwatched_movies[["title", "content_based_rating2"]].sort_values("content_based_rating2", ascending=False)
title content_based_rating2
movieId
81132 Rubber (2010) 4.783002
49593 She (1965) 4.725136
5018 Motorama (1991) 4.717902
71999 Aelita: The Queen of Mars (Aelita) (1924) 4.687161
72165 Cirque du Freak: The Vampire's Assistant (2009) 4.569620
... ... ...
100246 Pretty Sweet (2012) 0.000000
99 Heidi Fleiss: Hollywood Madam (1995) 0.000000
100266 Day Is Done (2011) 0.000000
100287 Head Games (2012) 0.000000
96842 Behind the Burly Q: The Story of Burlesque in ... 0.000000

27103 rows × 2 columns

Both of the tables above are equivalent to the table presented in step 5 of the example.

Discussion

One disadvantage of content-based filtering is that we can only recommend an item for a user if that user has rated at least one other item that contains at least one feature in common with the item in question.

In our case, we can only recommend movies to a user if that movie contains at least one genre in common with another movie that the user has already rated.

The fewer features an unrated item has with rated ones, the less granular its inferred rating will be.

The diagrams below illustrate a situation in which a user has ratings for 17 genres, computed from the ratings of movies that contain those genres. In the first example, a movie with 5 genres has 2 genres in common with rated genres for the user, so we can compute an inferred rating for the movie.

In the second example, a movie with 5 genres has 0 genres in common with the rated genres for the user, so we cannot directly compute an inferred rating for it.

from matplotlib_venn import venn2
import matplotlib.pyplot as plt

movie_genres_count = 5
intersection = 2
set1 = genres_with_ratings - intersection
set2 = movie_genres_count - intersection
venn2(subsets=(set1, set2, intersection), set_labels=("Features with\nuser ratings", "Movie features"), alpha=0.5)
plt.title(f"Movie that can be rated based on {intersection} features")
plt.show()
venn2(
  subsets=(genres_with_ratings, movie_genres_count, 0),
  set_labels=("Features with\nuser ratings", "Movie features"),
  alpha=0.5,
)
plt.title("Movie that cannot be rated")
plt.show()

png

png

Because in our example, this user has watched over 100 movies, we are able to recommend over 90% of unwatched movies to them.

movies_with_inferred_ratings = (~unwatched_movies["content_based_rating"].isna()).sum()
print(f"Number of unwatched movies: {len(unwatched_movies)}")
print(f"Number of unwatched movies with inferred ratings: {movies_with_inferred_ratings}")
print(
  f"Percentage of unwatched movies with inferred ratings: {movies_with_inferred_ratings / len(unwatched_movies):.2%}"
)
Number of unwatched movies: 27103
Number of unwatched movies with inferred ratings: 24901
Percentage of unwatched movies with inferred ratings: 91.88%

Another disadvantage is that, depending on the number of features and technique employed in computing inferred ratings, the granularity of ratings can be very poor.

In our case, although we inferred ratings for over 24,000 movies, over 50% of them received one of 10 ratings.

inferred_ratings_count = unwatched_movies["content_based_rating"].value_counts()
topk = 10

print(f"Unique inferred ratings: {len(inferred_ratings_count)}")
print(
  f"Percentage of unwatched movies in the top {topk} inferred ratings: {(inferred_ratings_count.iloc[:topk].sum() / movies_with_inferred_ratings):.2%}"
)
Unique inferred ratings: 1182
Percentage of unwatched movies in the top 10 inferred ratings: 51.42%

This happens because, while we have \(n\) total features to represent user taste and items (the movie genres), the current user only has ratings for a subset \(m \leq n\) of the features.

Using a simple average over genre ratings, a movie’s inferred rating can only come from one of \(2^m-1\) possible values.

If the number of genres of the movie is known, say \(k\), then this value is restricted to \(\binom{m}{k}\), assuming that each possible rating is different.

The table below shows the number of possible ratings a movie may have, given the number of genres it has.

from scipy.special import comb

genres_count = unwatched_movies[genres].sum(axis=1).value_counts()
possible_ratings = pd.DataFrame(
  index=genres_count.index,
  data={
    "Number of Movies": genres_count,
    "Possible ratings": [comb(genres_with_ratings, i) for i in genres_count.index],
  },
)
possible_ratings["Movies per possible rating"] = (
  possible_ratings["Number of Movies"] / possible_ratings["Possible ratings"]
)
possible_ratings.index.name = "Number of genres"
possible_ratings
Number of Movies Possible ratings Movies per possible rating
Number of genres
1 10816 17.0 636.235294
2 8766 136.0 64.455882
3 5260 680.0 7.735294
4 1685 2380.0 0.707983
5 468 6188.0 0.075630
6 82 12376.0 0.006626
7 20 19448.0 0.001028
8 5 24310.0 0.000206
10 1 19448.0 0.000051

Conclusion

In this project, we have built a movie recommendation system based on content-based filtering.

We went through the theory of content-based filtering as well as an example, then used a publicly available dataset to implement the system. We compared two ways of building user features and saw one of the downsides of content-based filtering, which is the inability of recommending new items to a user if those items don’t have intersecting features with the items the user has already rated.

See you guys next time!




    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Building text-based movie recommendation systems using the TMDB dataset
  • Running a Gemma-powered question-answering chatbot locally with LangChain + Ollama
  • Answering questions from an Obsidian database with LLMs + RAG
  • Using task-spooler to queue experiments on Linux
  • Resources to self-study mathematics for machine learning