This project uses item based collaborative filtering to make movie recommendations. Because this project was unique I implemented cross-validation and hyperparameter tuning from scratch, and defined project specific metrics. movies

I’ll use a fairly simple correlation function to model item similarity based on similar ratings in a movie databased. The basic implementation isn’t too complicated but I’ll also tune the models against three metrics:

  1. A cross validated score for the model’s ability to predict how a user will rate movies
  2. The number of movies recommended to a user that they haven’t previously seen/rated
  3. We will apply an eye test

These are the hyperparameters to be tuned:

  • types of correlation
  • min_period

Dataset citation: F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

Also thanks to Sun Dog Education for guidance in how to implement this.

Import Libraries

import numpy as np
import pandas as pd

from matplotlib import pyplot as plt
%matplotlib inline

import random

Importing and Preparing Data

ratings = pd.read_csv('ml-1m/ratings.dat', sep = '::', names=['user_id', 'movie_id', 'rating'], usecols=range(3), encoding='ISO-8859-1')
ratings.head()
/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:1: ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support regex separators (separators > 1 char and different from '\s+' are interpreted as regex); you can avoid this warning by specifying engine='python'.
  """Entry point for launching an IPython kernel.
user_id movie_id rating
0 1 1193 5
1 1 661 3
2 1 914 3
3 1 3408 4
4 1 2355 5
# changing the rating system so that the mean is 0
ratings['rating'] = ratings['rating'] - ratings.rating.mean()
ratings.rating.head()
0    1.418436
1   -0.581564
2   -0.581564
3    0.418436
4    1.418436
Name: rating, dtype: float64
by_user = ratings.pivot_table(index=['user_id'], columns=['movie_id'], values='rating')
by_user.head()
movie_id 1 2 3 4 5 6 7 8 9 10 ... 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952
user_id
1 1.418436 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN -1.581564 NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 3706 columns

Define score function

The cost function below is based on looking at the test fold and giving a +1 score for accurate predictions. I also want to penalize incorrect predictions, -2 seems like a reasonable weight since incorrect predictions probably hurt trust more than accurate predictions build them. If a user likes a movie that doesn’t get recommended we’d also like to punish this slightly and if a user doesn’t like a movie and it isn’t recommended we’d like to reward this slightly. One way of doing this is by making the cost function a product of two numbers: an actual score and a predicted score. The actual score is +1 for liked movies, -2 for not liked movies. The predicted score is +1 for predicted movies, -0.2 for movies not predicted. This means a true negative is rewarded .4, and a false negative punished .2.

# this will get the cross validation scores for an algorithm
# the score will increase for correctly predicted favorable movies, it will decrease if predicting unfavorable movies
def cross_validate_corr(num_folds, user_df, corr, rand_seed):
    random.seed(rand_seed)
    # as mentioned before we're tracking two quantitative metrics
    user_scores1 = []
    user_scores2 = []
    # this will generate folds that we will reuse for each user
    fold_size = int(len(user_df.T)/num_folds)
    folds = []
    movies = list(user_df.T.index)
    for i in range(num_folds):
        fold = []
        while len(fold) < fold_size:
            rand_index = random.randrange(len(movies))
            fold.append(movies.pop(rand_index))
        folds.append(fold)
    #get a score for each user before averaging over all users
    for user_index, user in user_df.iterrows():
        # for each user evaluate metric 1 (error score) against the generated folds
        fold_scores1 = []
        fold_scores2 = []
        for fold in folds:
            score = []
            test = user.loc[fold]
            train = user.drop(fold)
            returns = recommendations_from_corr(train, corr)
            test_clean = test.dropna()
            for movie in test_clean.index:
                if test_clean[movie] > 0:
                    actual = 1
                else:
                    actual = -2
                if returns.get(movie, default=0)==0:
                    expected = -0.2
                else:
                    expected = 1
                score.append(actual * expected)
            score_sum = np.sum(score)
            fold_scores1.append(score_sum)
            fold_scores2.append(len(returns))
        user_scores1.append(np.mean(fold_scores1))
        user_scores2.append(np.mean(fold_scores2))
        if(len(user_scores1)%100 == 0):
            print(len(user_scores1), np.mean(user_scores1), np.mean(user_scores2))
    return np.mean(user_scores1), np.mean(user_scores2)

Define recommendations function

def recommendations_from_corr(user_series, corr):
    user_ratings = user_series.dropna()
    sim_candidates = pd.Series()
    for movie in user_ratings.index:
        sims = corr[movie].dropna()
        sims = sims.map(lambda x: x * user_ratings[movie])
        sim_candidates = sim_candidates.append(sims)    
    sim_candidates = sim_candidates.groupby(sim_candidates.index).sum()
    overlap = (user_ratings.index & sim_candidates.index)
    sim_candidates.drop(labels = overlap, inplace=True)
    sim_candidates.sort_values(ascending=False, inplace=True)
    sim_candidates = sim_candidates[sim_candidates>0]
    return sim_candidates

Attempt recommendation and cross validation functions

The cross-validation is taking too long with this large of a dataset, I’m going to use a smaller dataset for now.

small_ratings = pd.read_csv('ml-latest-small/ratings.csv', usecols=range(3))
small_ratings.head()
userId movieId rating
0 1 1 4.0
1 1 3 4.0
2 1 6 4.0
3 1 47 5.0
4 1 50 5.0
small_ratings['rating'] = small_ratings.rating - small_ratings.rating.mean()
small_ratings.head()
userId movieId rating
0 1 1 0.498443
1 1 3 0.498443
2 1 6 0.498443
3 1 47 1.498443
4 1 50 1.498443
small_by_user = small_ratings.pivot_table(index=['userId'], columns=['movieId'], values='rating')
corr = small_by_user.corr(method='pearson', min_periods=50)
corr.head()
movieId 1 2 3 4 5 6 7 8 9 10 ... 193565 193567 193571 193573 193579 193581 193583 193585 193587 193609
movieId
1 1.000000 0.330978 NaN NaN NaN 0.106465 NaN NaN NaN -0.021409 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.330978 1.000000 NaN NaN NaN NaN NaN NaN NaN 0.016626 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN 1.0 NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 9724 columns

user_3 = small_by_user.loc[3]
print(recommendations_from_corr(user_3, corr).head(10))
1527    0.940285
208     0.815594
316     0.694791
344     0.524894
10      0.420379
111     0.408094
733     0.402525
3793    0.400005
736     0.374564
5816    0.258474
dtype: float64
print(cross_validate_corr(num_folds=3, user_df=small_by_user, corr=corr, rand_seed=1))
100 6.383333333333333 188.83666666666667
200 5.571666666666666 183.69833333333332
300 6.122444444444445 185.75888888888892
400 6.144500000000001 184.79666666666665
500 6.648533333333333 186.24333333333334
600 6.695555555555555 188.48777777777775
(6.872021857923498, 188.32950819672132)

Testing our hyperparameters

# the scores from the first cross validation test were relatively stable over the course of each of the 100 users
# I'll try comparing the hyperparameters with a sample of users
small_by_user_index = list(small_by_user.index)
small_by_user_sample = []
while len(small_by_user_sample) < 100:
    rand_index = random.randrange(len(small_by_user_index))
    small_by_user_sample.append(small_by_user_index.pop(rand_index))
small_by_user_sample = small_by_user.loc[small_by_user_sample]
small_by_user_sample.shape
(100, 9724)
method_list = ['pearson', 'spearman']
min_periods_list = [20, 100]
# generate the correlation matrices and store them into a dictionary
corr_matrices = {}
for method in method_list:
    for min_periods in min_periods_list:
        corr_matrices[(method, min_periods)] = small_by_user.corr(method=method, min_periods=min_periods)
        print(method, min_periods, 'done.')
pearson 20 done.
pearson 100 done.
spearman 20 done.
spearman 100 done.

The kendall correlation calculation corr_matrices[('kendall',)] = small_by_user.corr(method='kendall') took too long, so I’m skipping it.

# score each correlation matrix with the various hyperparameters
scores = {}
for index, corr in corr_matrices.items():
    scores[index] = cross_validate_corr(num_folds=3, user_df=small_by_user_sample, corr=corr, rand_seed=1)
print(scores)
100 6.953333333333334 638.9866666666667
100 5.86533333333333 35.42333333333333
100 6.725333333333332 642.68
100 5.837333333333331 35.419999999999995
{('pearson', 20): (6.953333333333334, 638.9866666666667), ('pearson', 100): (5.86533333333333, 35.42333333333333), ('spearman', 20): (6.725333333333332, 642.68), ('spearman', 100): (5.837333333333331, 35.419999999999995)}

The methods for correlation are fairly similar in score, they don’t differ too much on the accuracy metric based on my cost function. They differ significantly on the second metric. Lets see how the eye test differentiates between a pearson correlation with minimum periods of 20 and 100.

Eye test

small_movies = pd.read_csv('ml-latest-small/movies.csv', usecols=range(2))
small_ratings = pd.read_csv('ml-latest-small/ratings.csv', usecols=range(3))
small_ratings = pd.merge(small_movies, small_ratings)
small_ratings.head()
movieId title userId rating
0 1 Toy Story (1995) 1 4.0
1 1 Toy Story (1995) 5 4.0
2 1 Toy Story (1995) 7 4.5
3 1 Toy Story (1995) 15 2.5
4 1 Toy Story (1995) 17 4.5
small_ratings['rating'] = small_ratings.rating - small_ratings.rating.mean()
small_ratings.head()
movieId title userId rating
0 1 Toy Story (1995) 1 0.498443
1 1 Toy Story (1995) 5 0.498443
2 1 Toy Story (1995) 7 0.998443
3 1 Toy Story (1995) 15 -1.001557
4 1 Toy Story (1995) 17 0.998443
small_by_user = small_ratings.pivot_table(index=['userId'], columns=['title'], values='rating')
def test_single_user(user_series, corr, test_perc=.33, rand_seed=1):
    random.seed(rand_seed)
    test_size = int(len(user_series)*test_perc)
    user_copy = user_series.copy()
    test = {}
    while len(test) < test_size:
        rand_index = random.choice(user_copy.index)
        test[rand_index] = user_copy.pop(rand_index)
    score = []
    test = pd.Series(test)
    train = user_series.drop(test.index)
    returns = recommendations_from_corr(train, corr)
    test_clean = test.dropna()
    for movie in test_clean.index:
        if test_clean[movie] > 0:
            actual = 1
        else:
            actual = -2
        if returns.get(movie, default=0)==0:
            expected = -0.2
        else:
            expected = 1
        print(movie, actual * expected)
        score.append(actual * expected)
    score = np.sum(score)
    print('score:', score)
    return score

Pearson 20 eye test.

pear_20 = small_by_user.corr(method='pearson', min_periods=20)
pear_20.head()
title '71 (2014) 'Hellboy': The Seeds of Creation (2004) 'Round Midnight (1986) 'Salem's Lot (2004) 'Til There Was You (1997) 'Tis the Season for Love (2015) 'burbs, The (1989) 'night Mother (1986) (500) Days of Summer (2009) *batteries not included (1987) ... Zulu (2013) [REC] (2007) [REC]² (2009) [REC]³ 3 Génesis (2012) anohana: The Flower We Saw That Day - The Movie (2013) eXistenZ (1999) xXx (2002) xXx: State of the Union (2005) ¡Three Amigos! (1986) À nous la liberté (Freedom for Us) (1931)
title
'71 (2014) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
'Hellboy': The Seeds of Creation (2004) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
'Round Midnight (1986) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
'Salem's Lot (2004) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
'Til There Was You (1997) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 9719 columns

user_10 = small_by_user.loc[10]
returns_pear_20 = recommendations_from_corr(user_10, pear_20)
returns_pear_20.head(10)
Fugitive, The (1993)                                4.984729
Star Trek IV: The Voyage Home (1986)                4.222668
City Slickers (1991)                                4.019699
Sneakers (1992)                                     3.984689
Harry Potter and the Goblet of Fire (2005)          3.956484
Final Fantasy: The Spirits Within (2001)            3.901192
X2: X-Men United (2003)                             3.784831
Thank You for Smoking (2006)                        3.657521
Harry Potter and the Order of the Phoenix (2007)    3.459060
Mission: Impossible (1996)                          3.230248
dtype: float64
test_single_user(user_10, pear_20)
Wedding Date, The (2005) 0.4
Skyfall (2012) 1
Terminal, The (2004) -2
Avatar (2009) 0.4
Mona Lisa Smile (2003) -0.2
Bourne Ultimatum, The (2007) -2
First Daughter (2004) -0.2
Help, The (2011) 0.4
Notting Hill (1999) -2
Something's Gotta Give (2003) -0.2
Shrek (2001) 1
Love Actually (2003) 1
Mulan (1998) 1
Pulp Fiction (1994) 0.4
Match Point (2005) 0.4
Twilight Saga: Eclipse, The (2010) 0.4
Enough Said (2013) 0.4
Prince & Me, The (2004) -0.2
Mary Poppins (1964) 0.4
Magic Mike (2012) -0.2
Dark Knight Rises, The (2012) 1
Best Exotic Marigold Hotel, The (2011) -0.2
Grand Budapest Hotel, The (2014) 0.4
Tangled Ever After (2012) -0.2
St Trinian's 2: The Legend of Fritton's Gold (2009) 0.4
American Beauty (1999) 0.4
Twilight Saga: Breaking Dawn - Part 2, The (2012) 0.4
Graduate, The (1967) -2
27 Dresses (2008) 0.4
Matrix, The (1999) 0.4
Twilight (2008) -0.2
Sixth Sense, The (1999) 0.4
Morning Glory (2010) 0.4
When Harry Met Sally... (1989) -2
Rust and Bone (De rouille et d'os) (2012) 0.4
Despicable Me 2 (2013) -0.2
Amazing Spider-Man, The (2012) -2
The Hundred-Foot Journey (2014) -0.2
Made of Honor (2008) 0.4
Chasing Liberty (2004) -0.2
Quantum of Solace (2008) -2
Dark Knight, The (2008) 1
Hitch (2005) 1
Frozen (2013) 1
Interstellar (2014) 0.4
Pretty One, The (2013) -0.2
How Do You Know (2010) 0.4
Love and Other Drugs (2010) 0.4
Valentine's Day (2010) 0.4
Fight Club (1999) 0.4
Maid in Manhattan (2002) 0.4
score: 1.2000000000000006





1.2000000000000006

Pearson 100 eye test.

pear_100 = small_by_user.corr(method='pearson', min_periods=100)
pear_100.head()
title '71 (2014) 'Hellboy': The Seeds of Creation (2004) 'Round Midnight (1986) 'Salem's Lot (2004) 'Til There Was You (1997) 'Tis the Season for Love (2015) 'burbs, The (1989) 'night Mother (1986) (500) Days of Summer (2009) *batteries not included (1987) ... Zulu (2013) [REC] (2007) [REC]² (2009) [REC]³ 3 Génesis (2012) anohana: The Flower We Saw That Day - The Movie (2013) eXistenZ (1999) xXx (2002) xXx: State of the Union (2005) ¡Three Amigos! (1986) À nous la liberté (Freedom for Us) (1931)
title
'71 (2014) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
'Hellboy': The Seeds of Creation (2004) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
'Round Midnight (1986) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
'Salem's Lot (2004) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
'Til There Was You (1997) NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 9719 columns

user_10 = small_by_user.loc[10]
returns_pear_100 = recommendations_from_corr(user_10, pear_100)
returns_pear_100.head(15)
Fugitive, The (1993)           0.542148
Mission: Impossible (1996)     0.449540
Lion King, The (1994)          0.423012
Beauty and the Beast (1991)    0.394659
Apollo 13 (1995)               0.273224
Mask, The (1994)               0.242050
Batman Forever (1995)          0.238954
Mrs. Doubtfire (1993)          0.238462
Speed (1994)                   0.233273
True Lies (1994)               0.200027
dtype: float64
test_single_user(user_10, pear_100)
Wedding Date, The (2005) 0.4
Skyfall (2012) -0.2
Terminal, The (2004) 0.4
Avatar (2009) 0.4
Mona Lisa Smile (2003) -0.2
Bourne Ultimatum, The (2007) 0.4
First Daughter (2004) -0.2
Help, The (2011) 0.4
Notting Hill (1999) 0.4
Something's Gotta Give (2003) -0.2
Shrek (2001) -0.2
Love Actually (2003) -0.2
Mulan (1998) -0.2
Pulp Fiction (1994) -2
Match Point (2005) 0.4
Twilight Saga: Eclipse, The (2010) 0.4
Enough Said (2013) 0.4
Prince & Me, The (2004) -0.2
Mary Poppins (1964) 0.4
Magic Mike (2012) -0.2
Dark Knight Rises, The (2012) -0.2
Best Exotic Marigold Hotel, The (2011) -0.2
Grand Budapest Hotel, The (2014) 0.4
Tangled Ever After (2012) -0.2
St Trinian's 2: The Legend of Fritton's Gold (2009) 0.4
American Beauty (1999) -2
Twilight Saga: Breaking Dawn - Part 2, The (2012) 0.4
Graduate, The (1967) 0.4
27 Dresses (2008) 0.4
Matrix, The (1999) -2
Twilight (2008) -0.2
Sixth Sense, The (1999) -2
Morning Glory (2010) 0.4
When Harry Met Sally... (1989) 0.4
Rust and Bone (De rouille et d'os) (2012) 0.4
Despicable Me 2 (2013) -0.2
Amazing Spider-Man, The (2012) 0.4
The Hundred-Foot Journey (2014) -0.2
Made of Honor (2008) 0.4
Chasing Liberty (2004) -0.2
Quantum of Solace (2008) 0.4
Dark Knight, The (2008) 1
Hitch (2005) -0.2
Frozen (2013) -0.2
Interstellar (2014) 0.4
Pretty One, The (2013) -0.2
How Do You Know (2010) 0.4
Love and Other Drugs (2010) 0.4
Valentine's Day (2010) 0.4
Fight Club (1999) -2
Maid in Manhattan (2002) 0.4
score: -2.4000000000000004





-2.4000000000000004
print(len(returns_pear_20), len(returns_pear_100))
371 10

Based on the Eye Test as the scores suggested it seems that both have similar accuracy, but the 100 min_periods limit doesn’t give enough recommendations, 10, to be very interesting. On the other hand 371 might be too many. We might be best served picking a min_periods value that’s in between, perhaps 50, we saw in our first cross validation test that that gave us an average of 188 recommendations, this seems roughly reasonable. If we scaled up the data (number of reviews, movies, users) this min_periods value would need to be tuned again. It’d be interesting to think about how to adjust tune this regularly. In another project I compared the min_periods limit to a p-value restriction on this same task: movie recommendations. min_periods seemed to work better than p-value which was susceptible to issues of low data.