Movie Recommendation System(Content Based Recommendation)

Manpreet Singh
Apr 29, 2022
5 min read

Updated: Apr 29, 2022

Dataset=https://www.kaggle.com/datasets/tmdb/tmdb-movie-metadata?resource=download

What is Movie Recommendation System?

A Movie Recommendation System is a program that returns the movie title names on the basis of different filters and parameters.

Types of Recommendation System:

Content Based Recommendation
Popularity Based Recommendation
Collaborative System
User Based Recommendation

Using these filters we can make a curated list of movie titles on the inputs given by the users. In this blog we are going to use content based recommendation and filtering our data with respect to the meta data of the movies.

What is Content Based Recommendation?

A content based algorithm uses the meta data of the movie like Director name, movie cast, crew members, movie taglines and movie overview. By using this meta data we construct a similarity matrix that defines the distance between the user input's movie and the rest of the movies in the data. This helps us in understanding how similar are movies .

Importing Data:

Fetching the data and replacing all the empty rows with an empty string.

Data Overview:

We have two datasets:

Credits with columns:

id title cast crew

Movies with columns:

budget genres homepage id keywords original_language original_title overview popularity production_companies production_countries release_date revenue runtime spoken_langugages status tagline title vote_average vote_count

Merging the two datasets to get the final dataset for manipulation

Data Visualization:

Top 10 movies w.r.t Budget

Top 10 Profitable movies

Top 10 Popular Movies:

Top 10 Movies with most Votes:

Constructing the Meta Data:

For our algorithm to be able to map to the most similar movies we need to create a meta data that helps our algorithm to define the uniqueness of every movie. To construct our meta data we are going to use several columns and combining them all together in a single column.

Columns for Meta data:

Genres
Keywords
Overview
Movie Cast

With these 4 data columns, we accumlate a column keywordsPool, which contains all the concatinated value of the columns. The next step is to convert this data into numbers that can define the distance matrix and help in calculating the similarity.

To convert our keywordsPool into a matrix we use the library from sklearn, Count Vectorizer. The count vectorizer converts the text into a matrix of token counts. After getting the matrix of tokens, the next step is to find the distance between all the matrices. To find out that we are going to use Cosine Similarity.

Cosine Similarity:

Cosine Similarty is basically a metric which is used to calculate the similarity between the documents or texts without caring about their size. This metric will help us calculate the distance as well as how similar our data is from each other. The consine similarity basically is the core part of the program as it helps us in finding out the distance of one movie from the other.

Syntax of using Cosine Similarity in Python:

Here our count matrix is the matrix we got after applying the Count Vectorizer. Count matrix is the matrix of tokens of the string document.

from sklearn.metrics.pairwise import cosine_similarity
similarity= cosine_similarity(count_matrix)

Recommendation System Algorithm:

Our recommendation system runs on the cosine similarity list. Once we have calculated the count matrix and calculated the cosine similarity of the count matrix. The next part is just to get the movie name and return the list of similarity from our cosine matrix. That consists of the scores of the similarity of that movie with respect to all the movies present in the data.

After we have the similarity list of that movie we just pop out the top 5 movies from the list with respect to the scores of the top 5 highest elements.

Final Performance:

The final interpretation can be concluded by saying that, our recommender mostly is biased on the genre of the particular movie. The above two results confirms our interpretation, as it recommends top action movies.

Contribuition:

I created most of the code by myself, i took the idea of using cosine similarity from the lectures that professor gave us and to see the practical implementation I took the reference from the Reference [1] on how to built the recommendation system using cosine similarity. In this reference they structure a recommendation system using the similarity matrix. The author of that notebook only used overview column in calculating the similarity matrix. I personally believe that there is more data that we can use to create our cosine similarity more strong. Hence, I used 4 columns, overview, cast, keywords and genre. I took the idea of using cosine similarity and count vectorizer from this reference and created my own keyword pool by deconstructing the dataframe using various small functions mentioned in the notebook. I created a list of all the 4 features mentioned above by deconstructing the JSON object and using traditional python code to deconstruct it and getting only the names of the meta data and dropping all the redundant values. Lastly, i cleaned the dataset and filled all the NaN values by empty string, this helped to concat the final keywordPool. All of the mentioned below references uses their own pool of meta data, it can be observed that when creating a content based recommendation system, the basic architecture lies on the meta data. All the content based recommendation uses cosine similarity for the basis of simplicity and to achieve more accuracy in recommending the correct list of recommendations.

My major contribuition was in deconstructing the data into correct format and then appending that data into the pool of keywords. For the similarity I chose those 4 columns specifically because it was an intuitive call to calculate the similarity.

Expreimented through various data columns to create the keywordsPool and tried different combination of columns to see which one fits the best and then used the 4 columns mentioned above.

I also spent a good time on researching about finding similarity between two documents. And how distance is calculated between two text documents. Came across various distance metrics like Jaccard, Euclidean and cosine formula to calculate the distance and decided to use Cosine which is widely used in content based recommendation system.

Challenges:

The only challenge in this assesment was to choose the correct metric for calculating the similarity, I read about the different types of metrics and finally choose to use Cosine Similarity. The other difficulty was to deconstructing the json format into list, for that i researched over the internet and found out resoruces that i have mentioned below on deconstructing the json file. And then applied them into my project. The other part was to build the Count vector matrix, i checked out the official documentation of the sklearn to implement this feature.

Working of the System:

We take the features that is used to create the keywordsPool.
Then convert them into Count Vectors using Count Vectorizer.
Using the Count Vector matrix, we calculate the cosine similarity, using the sklearn library.
The cosine similarity uses the count vectors to create similarity list that contains the distance from all the other rows.
Using the list we pop the top 5 similar values i.e the items that have the highest similarity value.
This gives us the id of the top 5 movies and this id is fed into our get movie title function, which then returns the Movie titles most similar to the input given as recommendation.

Limitations:

The limitation of this model is that if if is given a keyword/ movie title that is not present in the dataset, then it will yield to a keyerror. This is because the key is not present in the dataset. To generalise it more we can implement a more complex that works on any keyword and gives a list of movies that aligns with that text. This can be implemented in future and can be applied to my TMDB OTT platform, this feature will help viewers on my platform in giving on suggestions to check out different movies.