Machine Learning Online Class

Exercise 7 | Principle Component Analysis and K-Means Clustering

Instructions

------------

This file contains code that helps you get started on the

exercise. You will need to complete the following functions:

pca.m

projectData.m

recoverData.m

computeCentroids.m

findClosestCentroids.m

kMeansInitCentroids.m

For this exercise, you will not need to change any code in this file,

or any other files other than those mentioned above.

Initialization

clear ; close all; clc;

================== Part 1: Load Example Dataset ===================

We start this exercise by using a small dataset that is easily to

visualize

fprintf('Visualizing example dataset for PCA.\n\n');

Visualizing example dataset for PCA.

% The following command loads the dataset. You should now have the

% variable X in your environment

load ('ex7data1.mat');

% Visualize the example dataset

plot(X(:, 1), X(:, 2), 'bo');

axis([0.5 6.5 2 8]);

% axis style 使用预定义样式设置轴范围和标尺。例如，将样式指定为 equal 以便沿着每个坐标轴使用相等的数据单位长度。

axis square;

=============== Part 2: Principal Component Analysis ===============

You should now implement PCA, a dimension reduction technique. You

should complete the code in pca.m

fprintf('\nRunning PCA on example dataset.\n\n');
Running PCA on example dataset.
%  Before running PCA, it is important to first normalize X
[X_norm, mu, ~] = featureNormalize(X);

%  Run PCA
[U, S] = pca(X_norm);
%  Compute mu, the mean of the each feature
%  Draw the eigenvectors centered at mean of data. These lines show the
%  directions of maximum variations in the dataset.
hold on;
% 最小化投影均方误差的而得到的U

疑惑：

这里为什么要这样画图？
U与S的乘积又是为何？
S究竟代表什么？

drawLine(mu, mu + 1.5 * S(1, 1) * U(:, 1)', '-k', 'LineWidth', 2);

drawLine(mu, mu + 1.5 * S(2, 2) * U(:, 2)', '-k', 'LineWidth', 2);

hold off;

fprintf('Top eigenvector: \n');

Top eigenvector: 

fprintf(' U(:,1) = %f %f \n', U(1,1), U(2,1));

 U(:,1) = -0.707107 -0.707107 

fprintf('\n(you should expect to see -0.707107 -0.707107)\n');

(you should expect to see -0.707107 -0.707107)

=================== Part 3: Dimension Reduction ===================

You should now implement the projection step to map the data onto the

first k eigenvectors. The code will then plot the data in this reduced

dimensional space. This will show you what the data looks like when

using only the corresponding eigenvectors to reconstruct it.

You should complete the code in projectData.m

fprintf('\nDimension reduction on example dataset.\n\n');

Dimension reduction on example dataset.

% Plot the normalized dataset (returned from pca)

plot(X_norm(:, 1), X_norm(:, 2), 'bo');

axis([-4 3 -4 3]); axis square

%  Project the data onto K = 1 dimension
K = 1;
Z = projectData(X_norm, U, K);
fprintf('Projection of the first example: %f\n', Z(1));
Projection of the first example: 1.481274
fprintf('\n(this value should be about 1.481274)\n\n');
(this value should be about 1.481274)

X_rec  = recoverData(Z, U, K);
fprintf('Approximation of the first example: %f %f\n', X_rec(1, 1), X_rec(1, 2));
Approximation of the first example: -1.047419 -1.047419
fprintf('\n(this value should be about  -1.047419 -1.047419)\n\n');
(this value should be about  -1.047419 -1.047419)

% Draw lines connecting the projected points to the original points

% 将投影的点与原始数据点相连接

hold on;

plot(X_rec(:, 1), X_rec(:, 2), 'ro');

for i = 1:size(X_norm, 1)

drawLine(X_norm(i,:), X_rec(i,:), '--k', 'LineWidth', 1);

end

hold off

=============== Part 4: Loading and Visualizing Face Data =============

We start the exercise by first loading and visualizing the dataset.

The following code will load the dataset into your environment

fprintf('\nLoading face dataset.\n\n');

Loading face dataset.

% Load Face dataset

load ('ex7faces.mat')

% Display the first 100 faces in the dataset

displayData(X(1:100, :));

=========== Part 5: PCA on Face Data: Eigenfaces ===================

Run PCA and visualize the eigenvectors which are in this case eigenfaces

We display the first 36 eigenfaces.

fprintf(['\nRunning PCA on face dataset.\n' ...
    '(this might take a minute or two ...)\n\n']);
Running PCA on face dataset.
(this might take a minute or two ...)
%  Before running PCA, it is important to first normalize X by subtracting
%  the mean value from each feature
[X_norm, ~, ~] = featureNormalize(X);
%  Run PCA
[U, ~] = pca(X_norm);
%  Visualize the top 36 eigenvectors found

疑问：

特征向量绘制出来的内容具有什么含义？

displayData(U(:, 1:36)');

============= Part 6: Dimension Reduction for Faces =================

Project images to the eigen space using the top k eigenvectors

If you are applying a machine learning algorithm

fprintf('\nDimension reduction for face dataset.\n\n');

Dimension reduction for face dataset.

K = 100;

Z = projectData(X_norm, U, K);

fprintf('The projected data Z has a size of: ')

The projected data Z has a size of: 

fprintf('%d ', size(Z));

5000 100 

% 现在的新特征Z已经和原本的X不同了，各个维度所代表的内涵也不再相同，所以直接画图是没有意义的

displayData(Z(1:100, :));

% 只有重新映射回来的时候，绘出的图像才有意义

X_rec = recoverData(Z, U, K);

displayData(X_rec(1:100, :));

==== Part 7: Visualization of Faces after PCA Dimension Reduction ====

Project images to the eigen space using the top K eigen vectors and

visualize only using those K dimensions

Compare to the original input, which is also displayed

fprintf('\nVisualizing the projected (reduced dimension) faces.\n\n');

Visualizing the projected (reduced dimension) faces.

K = 100;

X_rec = recoverData(Z, U, K);

% Display normalized data

subplot(1, 2, 1);

displayData(X_norm(1:100,:));

title('Original faces');

axis square;

% Display reconstructed data from only k eigenfaces

subplot(1, 2, 2);

displayData(X_rec(1:100,:));

title('Recovered faces');

axis square;

=== Part 8(a): Optional (ungraded) Exercise: PCA for Visualization ===

One useful application of PCA is to use it to visualize high-dimensional

data. In the last K-Means exercise you ran K-Means on 3-dimensional

pixel colors of an image. We first visualize this output in 3D, and then

apply PCA to obtain a visualization in 2D.

close all; close all; clc;

% Reload the image from the previous exercise and run K-Means on it
% For this to work, you need to complete the K-Means assignment first
A = double(imread('bird_small.png'));
% If imread does not work for you, you can try instead
%   load ('bird_small.mat');
A = A / 255;
img_size = size(A);
X = reshape(A, img_size(1) * img_size(2), 3);
K = 16;
max_iters = 10;
initial_centroids = kMeansInitCentroids(X, K);
[centroids, idx] = runkMeans(X, initial_centroids, max_iters);
K-Means iteration 1/10...
K-Means iteration 2/10...
K-Means iteration 3/10...
K-Means iteration 4/10...
K-Means iteration 5/10...
K-Means iteration 6/10...
K-Means iteration 7/10...
K-Means iteration 8/10...
K-Means iteration 9/10...
K-Means iteration 10/10...

% Sample 1000 random indexes (since working with all the data is

% too expensive. If you have a fast computer, you may increase this.

sel = floor(rand(1000, 1) * size(X, 1)) + 1;

% 加一是为了防止向下取整的时候为0，因为后面要当索引

% 1000个0~1的数乘以样本数，返回的是1000个随机选择的样本索引

% Setup Color Palette

palette = hsv(K);

colors = palette(idx(sel), :);

% Visualize the data and centroid memberships in 3D

figure;

scatter3(X(sel, 1), X(sel, 2), X(sel, 3), 10, colors);

title('Pixel dataset plotted in 3D. Color shows centroid memberships');

=== Part 8(b): Optional (ungraded) Exercise: PCA for Visualization ===

Use PCA to project this cloud to 2D for visualization

% Subtract the mean to use PCA

[X_norm, mu, sigma] = featureNormalize(X);

% PCA and project the data to 2D

[U, S] = pca(X_norm);

Z = projectData(X_norm, U, 2);

% Plot in 2D

figure;

% idx(sel)是为了上色用

plotDataPoints(Z(sel, :), idx(sel), K);

title('Pixel dataset plotted in 2D, using PCA for dimensionality reduction');