Machine Learning Online Class

Exercise 8 | Anomaly Detection and Collaborative Filtering

Instructions

------------

This file contains code that helps you get started on the

exercise. You will need to complete the following functions:

estimateGaussian.m

selectThreshold.m

cofiCostFunc.m

For this exercise, you will not need to change any code in this file,

or any other files other than those mentioned above.

=============== Part 1: Loading movie ratings dataset ================

You will start by loading the movie ratings dataset to understand the

structure of the data.

fprintf('Loading movie ratings dataset.\n\n');
Loading movie ratings dataset.
%  Load data
load ('ex8_movies.mat');

% Y is a 1682x943 matrix, containing ratings (1-5) of 1682 movies on

% 943 users

% R is a 1682x943 matrix, where R(i,j) = 1 if and only if user j gave a

% rating to movie i

% From the matrix, we can compute statistics like average rating.

fprintf('Average rating for movie 1 (Toy Story): %f / 5\n\n', ...

mean(Y(1, R(1, :))));

Average rating for movie 1 (Toy Story): 3.878319 / 5

% We can "visualize" the ratings matrix by plotting it with imagesc

imagesc(Y);

ylabel('Movies');

xlabel('Users');

============ Part 2: Collaborative Filtering Cost Function ===========

You will now implement the cost function for collaborative filtering.

To help you debug your cost function, we have included set of weights

that we trained on that. Specifically, you should complete the code in

cofiCostFunc.m to return J.

%  Load pre-trained weights (X, Theta, num_users, num_movies, num_features)
load ('ex8_movieParams.mat');

%  Reduce the data set size so that this runs faster
num_users = 4; 
num_movies = 5; 
num_features = 3;
% X是电影的特征矩阵，一行一个样本特征向量
% Theta是用户的参数矩阵，一行一个用户参数向量
X = X(1 : num_movies, 1 : num_features);
Theta = Theta(1:num_users, 1:num_features);
Y = Y(1 : num_movies, 1 : num_users);
R = R(1 : num_movies, 1 : num_users);
%  Evaluate cost function
J = cofiCostFunc([X(:) ; Theta(:)], Y, R, num_users, num_movies, num_features, 0);
           
fprintf(['Cost at loaded parameters: %f '...
         '\n(this value should be about 22.22)\n'], J);
Cost at loaded parameters: 22.224604 
(this value should be about 22.22)

============== Part 3: Collaborative Filtering Gradient ==============

Once your cost function matches up with ours, you should now implement

the collaborative filtering gradient function. Specifically, you should

complete the code in cofiCostFunc.m to return the grad argument.

fprintf('\nChecking Gradients (without regularization) ... \n');
Checking Gradients (without regularization) ... 
%  Check gradients by running checkNNGradients
checkCostFunction;
    2.4734    2.4734
   -3.3642   -3.3642
    3.0220    3.0220
   -0.0045   -0.0045
    1.3398    1.3398
   -3.0521   -3.0521
    2.0962    2.0962
    0.0222    0.0222
   -5.1495   -5.1495
    2.6312    2.6312
   -1.3667   -1.3667
    0.3324    0.3324
    1.4416    1.4416
   -0.5971   -0.5971
   -1.8176   -1.8176
    5.4534    5.4534
   -4.0567   -4.0567
   -0.4537   -0.4537
   -0.2140   -0.2140
    0.5173    0.5173
    3.0944    3.0944
    1.1219    1.1219
    0.7806    0.7806
   -0.2328   -0.2328
   -1.3480   -1.3480
    0.7513    0.7513
   -2.6773   -2.6773
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)

If your cost function implementation is correct, then 
the relative difference will be small (less than 1e-9). 

Relative Difference: 7.05038e-13

========= Part 4: Collaborative Filtering Cost Regularization ========

Now, you should implement regularization for the cost function for

collaborative filtering. You can implement it by adding the cost of

regularization to the original cost computation.

%  Evaluate cost function
J = cofiCostFunc([X(:) ; Theta(:)], Y, R, num_users, num_movies, ...
               num_features, 1.5);
           
fprintf(['Cost at loaded parameters (lambda = 1.5): %f '...
         '\n(this value should be about 31.34)\n'], J);
Cost at loaded parameters (lambda = 1.5): 31.344056 
(this value should be about 31.34)

======= Part 5: Collaborative Filtering Gradient Regularization ======

Once your cost matches up with ours, you should proceed to implement

regularization for the gradient.

fprintf('\nChecking Gradients (with regularization) ... \n');
Checking Gradients (with regularization) ... 
%  Check gradients by running checkNNGradients
checkCostFunction(1.5);
   -0.5757   -0.5757
    1.7256    1.7256
    1.8087    1.8087
    5.5917    5.5917
    0.1996    0.1996
   -7.6902   -7.6902
   -4.3051   -4.3051
  -10.4999  -10.4999
   -0.0717   -0.0717
   -3.6842   -3.6842
    0.5828    0.5828
   -2.0054   -2.0054
   -1.7368   -1.7368
   -1.9312   -1.9312
   -1.4579   -1.4579
   -2.4870   -2.4870
    2.1670    2.1670
    3.8779    3.8779
    1.6977    1.6977
    7.5505    7.5505
   14.9200   14.9200
   -1.5953   -1.5953
   -0.7142   -0.7142
    2.0173    2.0173
    1.9999    1.9999
    7.5847    7.5847
    0.2629    0.2629
The above two columns you get should be very similar.
(Left-Your Numerical Gradient, Right-Analytical Gradient)

If your cost function implementation is correct, then 
the relative difference will be small (less than 1e-9). 

Relative Difference: 2.01165e-12

============== Part 6: Entering ratings for a new user ===============

Before we will train the collaborative filtering model, we will first

add ratings that correspond to a new user that we just observed. This

part of the code will also allow you to put in your own ratings for the

movies in our dataset!

movieList = loadMovieList();
%  Initialize my ratings
my_ratings = zeros(1682, 1);
% Check the file movie_idx.txt for id of each movie in our dataset
% For example, Toy Story (1995) has ID 1, so to rate it "4", you can set
my_ratings(1) = 4;
% Or suppose did not enjoy Silence of the Lambs (1991), you can set
my_ratings(98) = 2;
% We have selected a few movies we liked / did not like and the ratings we
% gave are as follows:
my_ratings(7) = 3;
my_ratings(12)= 5;
my_ratings(54) = 4;
my_ratings(64)= 5;
my_ratings(66)= 3;
my_ratings(69) = 5;
my_ratings(183) = 4;
my_ratings(226) = 5;
my_ratings(355)= 5;
fprintf('\n\nNew user ratings:\n');
New user ratings:
for i = 1 : length(my_ratings)
    if my_ratings(i) > 0 
        fprintf('Rated %d for %s\n', my_ratings(i), ...
                 movieList{i});
    end
end
Rated 4 for Toy Story (1995)
Rated 3 for Twelve Monkeys (1995)
Rated 5 for Usual Suspects, The (1995)
Rated 4 for Outbreak (1995)
Rated 5 for Shawshank Redemption, The (1994)
Rated 3 for While You Were Sleeping (1995)
Rated 5 for Forrest Gump (1994)
Rated 2 for Silence of the Lambs, The (1991)
Rated 4 for Alien (1979)
Rated 5 for Die Hard 2 (1990)
Rated 5 for Sphere (1998)

================== Part 7: Learning Movie Ratings ====================

Now, you will train the collaborative filtering model on a movie rating

dataset of 1682 movies and 943 users

fprintf('\nTraining collaborative filtering...\n');
Training collaborative filtering...
%  Load data
load('ex8_movies.mat');

%  Y is a 1682x943 matrix, containing ratings (1-5) of 1682 movies by 
%  943 users
%
%  R is a 1682x943 matrix, where R(i,j) = 1 if and only if user j gave a
%  rating to movie i
%  Add our own ratings to the data matrix
% 相当于新增了一个个用户的数据
Y = [my_ratings, Y];
R = [(my_ratings ~= 0), R];
%  Normalize Ratings
[Ynorm, Ymean] = normalizeRatings(Y, R);

%  Useful Values
num_users = size(Y, 2);
num_movies = size(Y, 1);
num_features = 10;
% Set Initial Parameters (Theta, X)
% 正态分布的随机数
X = randn(num_movies, num_features);
Theta = randn(num_users, num_features);
initial_parameters = [X(:); Theta(:)];
% Set options for fmincg
options = optimset('GradObj', 'on', 'MaxIter', 100);
% Set Regularization
lambda = 10;
% 开始训练X Theta
theta = fmincg (@(t)(cofiCostFunc(t, Ynorm, R, num_users, num_movies, ...
                                num_features, lambda)), ...
                initial_parameters, options);
Iteration     1 | Cost: 7.493336e+05
Iteration     2 | Cost: 4.880861e+05
Iteration     3 | Cost: 2.987959e+05
Iteration     4 | Cost: 2.322157e+05
Iteration     5 | Cost: 1.770114e+05
Iteration     6 | Cost: 1.465684e+05
Iteration     7 | Cost: 1.268106e+05
Iteration     8 | Cost: 1.162170e+05
Iteration     9 | Cost: 1.080068e+05
Iteration    10 | Cost: 1.027622e+05
Iteration    11 | Cost: 9.724652e+04
Iteration    12 | Cost: 9.367597e+04
Iteration    13 | Cost: 9.234529e+04
Iteration    14 | Cost: 8.890725e+04
Iteration    15 | Cost: 8.677051e+04
Iteration    16 | Cost: 8.556865e+04
Iteration    17 | Cost: 8.443566e+04
Iteration    18 | Cost: 8.131790e+04
Iteration    19 | Cost: 7.945750e+04
Iteration    20 | Cost: 7.858425e+04
Iteration    21 | Cost: 7.774777e+04
Iteration    22 | Cost: 7.665230e+04
Iteration    23 | Cost: 7.550082e+04
Iteration    24 | Cost: 7.492713e+04
Iteration    25 | Cost: 7.432257e+04
Iteration    26 | Cost: 7.351994e+04
Iteration    27 | Cost: 7.301211e+04
Iteration    28 | Cost: 7.271555e+04
Iteration    29 | Cost: 7.236754e+04
Iteration    30 | Cost: 7.217922e+04
Iteration    31 | Cost: 7.191707e+04
Iteration    32 | Cost: 7.169348e+04
Iteration    33 | Cost: 7.145787e+04
Iteration    34 | Cost: 7.119558e+04
Iteration    35 | Cost: 7.092016e+04
Iteration    36 | Cost: 7.070979e+04
Iteration    37 | Cost: 7.046319e+04
Iteration    38 | Cost: 7.020285e+04
Iteration    39 | Cost: 7.010710e+04
Iteration    40 | Cost: 7.005226e+04
Iteration    41 | Cost: 6.999912e+04
Iteration    42 | Cost: 6.991734e+04
Iteration    43 | Cost: 6.985513e+04
Iteration    44 | Cost: 6.981437e+04
Iteration    45 | Cost: 6.976908e+04
Iteration    46 | Cost: 6.960313e+04
Iteration    47 | Cost: 6.945370e+04
Iteration    48 | Cost: 6.935492e+04
Iteration    49 | Cost: 6.928855e+04
Iteration    50 | Cost: 6.920689e+04
Iteration    51 | Cost: 6.917248e+04
Iteration    52 | Cost: 6.915740e+04
Iteration    53 | Cost: 6.914421e+04
Iteration    54 | Cost: 6.912195e+04
Iteration    55 | Cost: 6.910236e+04
Iteration    56 | Cost: 6.909230e+04
Iteration    57 | Cost: 6.908018e+04
Iteration    58 | Cost: 6.906652e+04
Iteration    59 | Cost: 6.904157e+04
Iteration    60 | Cost: 6.901334e+04
Iteration    61 | Cost: 6.895874e+04
Iteration    62 | Cost: 6.889058e+04
Iteration    63 | Cost: 6.885511e+04
Iteration    64 | Cost: 6.884749e+04
Iteration    65 | Cost: 6.882642e+04
Iteration    66 | Cost: 6.882023e+04
Iteration    67 | Cost: 6.881252e+04
Iteration    68 | Cost: 6.879600e+04
Iteration    69 | Cost: 6.876019e+04
Iteration    70 | Cost: 6.871203e+04
Iteration    71 | Cost: 6.868484e+04
Iteration    72 | Cost: 6.867702e+04
Iteration    73 | Cost: 6.866686e+04
Iteration    74 | Cost: 6.866216e+04
Iteration    75 | Cost: 6.865971e+04
Iteration    76 | Cost: 6.865196e+04
Iteration    77 | Cost: 6.864890e+04
Iteration    78 | Cost: 6.864051e+04
Iteration    79 | Cost: 6.862114e+04
Iteration    80 | Cost: 6.861094e+04
Iteration    81 | Cost: 6.860446e+04
Iteration    82 | Cost: 6.859538e+04
Iteration    83 | Cost: 6.858785e+04
Iteration    84 | Cost: 6.857818e+04
Iteration    85 | Cost: 6.857391e+04
Iteration    86 | Cost: 6.857215e+04
Iteration    87 | Cost: 6.856463e+04
Iteration    88 | Cost: 6.856270e+04
Iteration    89 | Cost: 6.856155e+04
Iteration    90 | Cost: 6.855144e+04
Iteration    91 | Cost: 6.853764e+04
Iteration    92 | Cost: 6.853423e+04
Iteration    93 | Cost: 6.853220e+04
Iteration    94 | Cost: 6.852683e+04
Iteration    95 | Cost: 6.852093e+04
Iteration    96 | Cost: 6.851416e+04
Iteration    97 | Cost: 6.851059e+04
Iteration    98 | Cost: 6.850676e+04
Iteration    99 | Cost: 6.850161e+04
Iteration   100 | Cost: 6.849832e+04
% Unfold the returned theta back into U and W
% 将X和Theta恢复正常形式
X = reshape(theta(1 : num_movies * num_features), num_movies, num_features);
Theta = reshape(theta(num_movies * num_features + 1 : end), num_users, num_features);
fprintf('Recommender system learning completed.\n');
Recommender system learning completed.
Program paused. Press enter to continue.

================== Part 8: Recommendation for you ====================

After training the model, you can now make recommendations by computing

the predictions matrix.

% 新用户的预测也是先用整体求出后，加上均值
p = X * Theta';
% 对训练出来的预估结果加上均值
p = 1682×944
    1.8800    1.9710    2.0311    1.7991    2.5449    2.0332    1.5129    2.0184    2.2048    2.3918    2.3236    2.2005    2.3358    1.6370    2.0754    1.6219    3.0511    1.5942    1.9533    1.6246    1.5455    1.8629    1.9774    1.5970    2.2110    2.3486    1.8730    1.5669    2.0410    2.0554    2.2106    1.7546    1.5966    1.9052    1.6522    1.8215    2.3256    2.0306    2.7489    2.0571    1.4315    1.7024    2.8677    2.3950    2.2417    2.3529    1.8727    1.7900    2.2600    0.3619
    2.1265    2.8987    2.3836    1.7005    2.7510    2.5709    2.0496    2.9104    2.5260    2.6200    2.9841    2.6371    2.9629    3.1121    2.9302    2.0108    3.1642    1.4477    2.5561    2.0562    2.1126    2.2198    2.7754    2.6062    2.9052    2.9431    1.9467    2.0210    2.6974    2.2473    2.5499    2.4426    1.8797    2.1294    2.3299    1.7422    2.7357    2.7374    3.0183    2.0295    1.8609    2.2883    3.1284    2.9603    2.7075    2.7283    2.5027    1.9355    2.4170    1.4942
    1.8186    2.5819    2.1273    1.3986    2.3608    2.0329    1.8147    3.4106    2.1723    2.3777    2.9738    2.2252    2.7219    1.5843    2.8756    1.1059    2.8234    1.3025    2.3195    1.7444    1.4135    2.6644    2.7298    2.3901    3.0138    2.4750    1.8301    2.0345    2.9194    2.0083    2.4596    2.3747    2.0886    1.9485    2.2197    1.2044    2.2608    2.6686    2.4327    1.6560    1.4872    2.1677    1.9010    2.2758    2.4549    2.3229    2.2188    1.5414    1.9386    2.2677
    2.0119    3.5402    2.5864    1.8126    2.8072    2.9022    2.6404    2.7435    2.8369    2.8003    2.9340    2.5135    2.9832    3.2193    3.0305    1.9217    3.3699    1.9985    2.7009    2.2307    1.3814    2.9140    3.3265    2.8731    3.1274    2.9607    2.0766    1.9331    2.8813    2.4734    2.6789    2.7750    2.0917    1.9143    2.3803    1.3725    2.2197    2.6876    2.1457    2.0886    1.7831    2.6348    2.6607    2.8072    2.6310    2.8137    2.6441    2.2029    2.4756    2.1106
    2.1488    2.3553    2.5251    1.7425    2.6336    1.8246    2.2591    3.9344    2.5459    2.7108    3.4279    2.9298    3.2709    1.8357    2.9440    2.2091    3.3497    1.3279    2.7944    2.0336    2.1909    2.4143    2.6516    2.5454    3.1653    2.8720    2.2036    2.2313    2.8320    2.2754    2.7635    2.4654    1.9511    2.1709    2.4404    1.7533    2.9798    2.5200    3.7571    1.8848    1.6775    2.1952    3.0474    3.0225    2.8985    2.6142    2.5107    1.8249    2.5228    1.9512
    2.0619    3.4751    2.9961    2.2643    3.1469    2.7330    3.0893    3.3813    2.8929    3.1407    3.6359    3.2790    3.1569    3.7896    3.3472    2.5584    3.5236    1.8547    3.6594    2.4032    1.8350    3.0969    2.7042    3.0070    3.4551    3.0551    2.4784    2.2063    2.7378    2.9660    3.0558    3.1187    2.1583    2.1985    3.1324    1.7295    2.6651    2.6775    2.9977    2.1470    2.1115    2.7076    2.7606    3.0121    2.7129    3.0926    3.0915    2.6033    2.9641    2.9786
    1.6657    2.6459    1.9873    1.9188    2.4707    2.8680    1.9268    2.4832    2.1333    2.1065    2.4133    2.0871    2.2479    0.5802    2.8262    0.9290    2.6082    1.7826    2.0835    1.6607    1.1437    2.4002    2.9358    2.6253    2.6076    2.4110    1.6626    1.7349    2.6565    1.9163    2.1565    2.3127    1.9039    1.8114    2.2911    1.2328    2.3224    2.3040    1.2749    1.7844    1.1457    2.3049    1.7691    1.9239    2.7365    2.1852    2.1767    1.8461    1.9710    1.9408
    2.0010    3.1714    2.7619    2.2999    2.9306    2.9317    3.1755    3.3238    2.7377    3.0001    3.3567    3.3061    3.1581    2.7744    3.4034    2.7144    3.5201    1.9669    3.2696    2.4304    2.0009    2.7418    2.9982    3.1692    3.0664    3.1680    2.2515    1.9843    2.6730    2.6432    2.7955    3.0216    1.8490    1.9485    2.5520    1.6988    2.9959    2.4501    3.6628    2.2142    1.8234    2.6409    3.3300    3.1738    2.9687    2.8444    2.7893    2.4965    2.8097    2.5107
    1.7067    2.7285    2.8337    2.4303    3.2748    2.0083    2.6933    2.9736    2.6250    2.9472    3.2127    3.1259    2.8768    1.1073    2.5050    2.5723    3.1203    2.1043    3.5981    2.2164    1.3637    3.0408    2.2187    2.5297    3.2477    2.6906    2.5131    2.0823    2.3941    2.9486    2.9521    2.7471    2.2276    2.2207    3.2341    1.7430    2.3491    1.9494    3.1846    2.2388    1.7557    2.2287    1.5679    2.9296    2.4132    2.6635    3.1199    2.9416    2.5994    3.0868
    1.9989    3.4797    2.9478    2.7461    3.4342    3.4216    3.1736    3.6711    2.6813    3.1564    3.7495    3.5922    3.1162    2.0342    3.9057    2.3245    3.4105    2.1286    3.8180    2.4977    1.9595    3.2685    3.0368    3.5626    3.5359    3.2714    2.4403    2.3366    3.0565    2.9929    3.0942    3.4379    2.3428    2.3801    3.4284    1.8017    3.2573    2.8363    3.3896    2.3915    1.9636    2.9405    2.4843    3.0071    3.2351    3.0328    3.2656    2.9105    2.8680    3.6415
my_predictions = p(:, 1) + Ymean;
movieList = loadMovieList();
[r, ix] = sort(my_predictions, 'descend');
fprintf('\nTop recommendations for you:\n');
Top recommendations for you:
for i = 1 : 10
    j = ix(i);
    fprintf('Predicting rating %.1f for movie %s\n', my_predictions(j), movieList{j});
end
Predicting rating 4.5 for movie Star Wars (1977)
Predicting rating 4.3 for movie Titanic (1997)
Predicting rating 4.2 for movie Raiders of the Lost Ark (1981)
Predicting rating 4.1 for movie Return of the Jedi (1983)
Predicting rating 4.0 for movie Empire Strikes Back, The (1980)
Predicting rating 4.0 for movie Braveheart (1995)
Predicting rating 3.9 for movie Shawshank Redemption, The (1994)
Predicting rating 3.9 for movie Godfather, The (1972)
Predicting rating 3.8 for movie Schindler's List (1993)
Predicting rating 3.8 for movie Fugitive, The (1993)
fprintf('\n\nOriginal ratings provided:\n');
Original ratings provided:
for i = 1 : length(my_ratings)
    if my_ratings(i) > 0 
        fprintf('Rated %d for %s\n', my_ratings(i), movieList{i});
    end
end
Rated 4 for Toy Story (1995)
Rated 3 for Twelve Monkeys (1995)
Rated 5 for Usual Suspects, The (1995)
Rated 4 for Outbreak (1995)
Rated 5 for Shawshank Redemption, The (1994)
Rated 3 for While You Were Sleeping (1995)
Rated 5 for Forrest Gump (1994)
Rated 2 for Silence of the Lambs, The (1991)
Rated 4 for Alien (1979)
Rated 5 for Die Hard 2 (1990)
Rated 5 for Sphere (1998)