Machine Learning Online Class

Exercise 8 | Anomaly Detection and Collaborative Filtering
Instructions
------------
This file contains code that helps you get started on the
exercise. You will need to complete the following functions:
estimateGaussian.m
selectThreshold.m
cofiCostFunc.m
For this exercise, you will not need to change any code in this file,
or any other files other than those mentioned above.

Initialization

clear ; close all; clc;

================== Part 1: Load Example Dataset ===================

We start this exercise by using a small dataset that is easy to
visualize.
Our example case consists of 2 network server statistics across
several machines: the latency and throughput of each machine.
This exercise will help us find possibly faulty (or very fast) machines.
fprintf('Visualizing example dataset for outlier detection.\n\n');
Visualizing example dataset for outlier detection.
% The following command loads the dataset. You should now have the
% variables X, Xval, yval in your environment
load('ex8data1.mat');
% Visualize the example dataset
plot(X(:, 1), X(:, 2), 'bx');
axis([0 30 0 30]);
xlabel('Latency (ms)');
ylabel('Throughput (mb/s)');

================== Part 2: Estimate the dataset statistics ===================

For this exercise, we assume a Gaussian distribution for the dataset.
We first estimate the parameters of our assumed Gaussian distribution,
then compute the probabilities for each of the points and then visualize
both the overall distribution and where each of the points falls in
terms of that distribution.
fprintf('Visualizing Gaussian fit.\n\n');
Visualizing Gaussian fit.
% Estimate my and sigma2
[mu, sigma2] = estimateGaussian(X);
% Returns the density of the multivariate normal at each data point (row)
% of X
p = multivariateGaussian(X, mu, sigma2);
% Visualize the fit
visualizeFit(X, mu, sigma2);
xlabel('Latency (ms)');
ylabel('Throughput (mb/s)');

================== Part 3: Find Outliers ===================

Now you will find a good epsilon threshold using a cross-validation set
probabilities given the estimated Gaussian distribution
% 直接使用相同的mu sigma2,调用这个函数就可以实现一个检测
pval = multivariateGaussian(Xval, mu, sigma2);
[epsilon, F1] = selectThreshold(yval, pval);
fprintf('Best epsilon found using cross-validation: %e\n', epsilon);
Best epsilon found using cross-validation: 8.961568e-05
fprintf('Best F1 on Cross Validation Set: %f\n', F1);
Best F1 on Cross Validation Set: 0.875000
fprintf(' (you should see a value epsilon of about 8.99e-05)\n');
(you should see a value epsilon of about 8.99e-05)
fprintf(' (you should see a Best F1 value of 0.875000)\n\n');
(you should see a Best F1 value of 0.875000)
% Find the outliers in the training set and plot the
% outliers = find(p < epsilon);
outliers = p < epsilon;
% Draw a red circle around those outliers
hold on
plot(X(outliers, 1), X(outliers, 2), 'ro', 'LineWidth', 2, 'MarkerSize', 10);
hold off

================== Part 4: Multidimensional Outliers ===================

We will now use the code from the previous part and apply it to a
harder problem in which more features describe each datapoint and only
some features indicate whether a point is an outlier.
% Loads the second dataset. You should now have the
% variables X, Xval, yval in your environment
load('ex8data2.mat');
% Apply the same steps to the larger dataset
[mu, sigma2] = estimateGaussian(X);
% Training set
% 训练集的p可以用来寻找训练集的异常值
p = multivariateGaussian(X, mu, sigma2);
% Cross-validation set
pval = multivariateGaussian(Xval, mu, sigma2);
% Find the best threshold
[epsilon, F1] = selectThreshold(yval, pval);
fprintf('Best epsilon found using cross-validation: %e\n', epsilon);
Best epsilon found using cross-validation: 1.371661e-18
fprintf('Best F1 on Cross Validation Set: %f\n', F1);
Best F1 on Cross Validation Set: 0.615385
fprintf(' (you should see a value epsilon of about 1.38e-18)\n');
(you should see a value epsilon of about 1.38e-18)
fprintf(' (you should see a Best F1 value of 0.615385)\n');
(you should see a Best F1 value of 0.615385)
% 训练集的p可以用来寻找训练集的异常值
fprintf('# Outliers found: %d\n\n', sum(p < epsilon));
# Outliers found: 117