Of the fast and loose practices in machine learning that I cringe at, throwing one’s favorite tool at high-dimensional data and expecting algorithms to learn to generalize well.

Part of the problem is we don’t always know if the features we chose are meaningful to the learning problem. Another part of the problem is that a useful regression requires that our sample fills the space adequately to regress.  (An n-sphere has a volume that scales as $V \approx R^n$.  This means that if you feel comfortable with 10 samples for a 1-d linear regression, you should have in your mind a million samples for a 6-d regression.)

One strategy for dealing with high-dimensionality is to rotate and scale the data set along axes of high sample variation. Two closely related mathematical tools are available to help with this: Singular Value Decomposition (SVD) and Principal Component Analysis). I won’t add anything to the mathematics described in these articles.  The purpose here is

1. Explore some questions with a simple example
2. Explore the link between SVD and PCA with some simple data to understand how it works, and build a little bit of intuition on what linear transformation can do for machine learning.
3. Demonstrate how this strategy could work to reduce the dimentionality of a problem (This really turns out to be reducing the dimentionality of the representation of the problem.)
4. Show how this strategy can easily go wrong in order to build intuition about when and how this might work for a real machine learning problem.
There is a link to the all of the code at the bottom.  To get this output, install IPython, the matplotlib, scipy, numpy, scikits.learn packages, then paste code into a session.  Or just edit the script to include the output you want and run from the command line.
Here we go…first SVD.

In [23]: ########################
In [24]: # Demo Part 1
In [25]: ########################
In [26]: # SVD: decompose a matrix X into product of unitary matrix (U), diagonal
In [27]: # matrix (S) and vector V, a rotation, (if it helps, think:
In [28]: # basis vectors, scaling, rotation) such that X.T = W x S x V_t.
In [29]: #
In [30]: # In machine learning, X will be a matrix of samples (4, rows)
In [31]: # and features (3, columns) for our learning examples.
In [32]: X_t = mat([ [ 1, 3, -10 ],
....: [ 0, 2, 1 ],
....: [ -1, -3, 9 ],
....: [ 0, -2, 0 ] ])
In [33]: X = X_t.T
In [34]: # The scipy library makes this step easy
In [35]: W, s, V_t = linalg.svd( X )
In [36]: S = linalg.diagsvd(s, len(X), len(V_t))
In [37]: recon = dot( dot( W, S), V_t)
In [38]: # Are these equal (to within rounding)?
In [39]: abs(X - recon)
Out[39]:
matrix([[ 4.44089210e-16, 1.73472348e-16, 6.66133815e-16,
1.59594560e-16],
[ 8.88178420e-16, 2.22044605e-16, 4.44089210e-16,
1.33226763e-15],
[ 1.77635684e-15, 4.44089210e-16, 5.32907052e-15,
6.73940070e-16]])
In [40]: # maximum error
In [41]: np.max(abs(X - recon))
Out[41]: 5.3290705182007514e-15
In [42]: # One key to understanding the link to PCA is
In [43]: # understanding the diagonal matrix, S
In [44]: S
Out[44]:
array([[ 14.19266482, 0. , 0. , 0. ],
[ 0. , 2.9255743 , 0. , 0. ],
[ 0. , 0. , 0.09633518, 0. ]])
In [45]: # Given that the features have zero-mean:
In [46]: [np.mean(i) for i in X]
Out[46]: [0.0, 0.0, 0.0]
In [47]: # s is an ordered vector that tells us the "significance" of each dimension
In [48]: # in the rotated space. (Yes, I arranged it that way. To be clear,
In [49]: # you can do SVD on a matrix without zero-mean features, but the dimension
In [50]: # reduction part we are about to do requires it.)
In [52]: # We can selectively set lower numbers to
In [53]: # zero to reduce dimension.
In [54]: s_red = s
In [55]: s_red[2] = 0
In [56]: # Approximately reconstruct our original matrix, but with
In [57]: # a reduced-dimension representation
In [58]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [59]: S_red
Out[59]:
array([[ 14.19266482, 0. , 0. , 0. ],
[ 0. , 2.9255743 , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
In [60]: recon_red = dot( dot( W, S_red), V_t)
In [61]: abs(X - recon_red)
Out[61]:
matrix([[ 0.04297068, 0.04061026, 0.05215936, 0.05451977],
[ 0.00118308, 0.00111809, 0.00143606, 0.00150105],
[ 0.00412864, 0.00390185, 0.00501149, 0.00523828]])
In [62]: # maximum error
In [63]: np.max(abs(X - recon_red))
Out[63]: 0.054519772460470559
In [64]: # ratio of errors
In [65]: np.max(abs(X - recon))/np.max(abs(X - recon_red))
Out[65]: 9.7745648554652029e-14
In [66]:
In [66]: # We "lost" 14 orders of magnitude in precision of the reconstruction, but this turns
In [67]: # out to be okay for some machine learning problems.

So that is a very simple SVD example.  Now use the same representation for a simple machine learning problem.   Classify to groups of data in a two dimensional space using Logistic Regression.   (Logistic regression is very good a solving problems just like this without SVD or PCA, so this is merely to get at how it all works together.)

In [68]: ########################
In [69]: # Demo Part 2
In [70]: ########################
In [71]: from scikits.learn import linear_model
In [72]: import scipy
In [73]: from copy import copy
In [74]: # Classification problem where points are linearly classifiable
In [75]: # in 2 dim.
In [76]: N=200
In [77]: x = np.random.normal(0,4,N)
In [78]: y1 = 3*x + np.random.normal(3,1,N)
In [79]: y2 = 3*x + np.random.normal(-3,1,N)
In [80]: y_avg = np.mean(np.append(y1, y2, 1))
In [81]: figure()
In [82]: plot(x,y1 - y_avg, 'bo');
In [83]: plot(x,y2 - y_avg, 'ro');
In [84]: title("Original data sets (0-average)")

Boundary between groups at $y(x) \approx 3x$

In [85]: # features x and y are rows in this matrix
In [86]: X = np.append([x,y1 - y_avg],[x,y2 - y_avg],1)
In [87]: X_t = X.T
In [88]: # y1 group 0; y2 group 1
In [89]: truth = np.append(scipy.ones([1, N]), scipy.zeros([1, N]), 1)
In [90]: # 2d model works very well
In [91]: lr = linear_model.LogisticRegression()
In [92]: lr.fit(np.asarray(X_t),truth[0])
Out[92]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [93]: lr.score(np.asarray(X_t),truth[0])
Out[93]: 1.0
In [94]: lr.predict(np.asarray(X_t))
Out[94]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
In [95]: # Now try dimension reduciton with SVD
In [96]: W, s, V_t = linalg.svd( X )
In [97]: s
Out[97]: array([ 273.98689602, 19.47579928])
In [98]: # both transformed representations are the same before trimming:
In [99]: S = linalg.diagsvd(s, len(X), len(V_t))
In [100]: np.max(abs(X.T*matrix(W) - matrix(V_t).T*matrix(S).T))
Out[100]: 6.7501559897209518e-14
In [101]: # Now work with the transformed coordinates. It might not have been clear
In [102]: # from above what the transformed coordinate system was. We can get there
In [103]: # by either the product of the first two or last two terms.
In [104]: X_prime = matrix(V_t).T*matrix(S).T
In [105]: x_prime = np.asarray(X_prime.T[0])
In [106]: y_prime = np.asarray(X_prime.T[1])
In [107]: figure()
In [108]: plot(x_prime, y_prime, 'go');
In [109]: title("Features after SVD Transformation")
Out[109]: <matplotlib.text.Text at 0x11b956890>

Boundary between groups at $y(x) \approx 0$

In [110]: # Linearly classifiable in 1-d? Try all new basis directions (extremes of variation)
In [111]: # Min variation - Training along y-dim nearly perfect
In [112]: ypt = np.asarray(y_prime.T)
In [113]: lr.fit(ypt, truth[0])
Out[113]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [114]: lr.score(ypt, truth[0])
Out[114]: 0.99750000000000005
In [115]: lr.predict(ypt)
Out[115]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
In [116]: # Max variation - Nothing here
In [117]: xpt = np.asarray(x_prime.T)
In [118]: lr.fit(xpt, truth[0])
Out[118]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [119]: lr.score(xpt, truth[0])
Out[119]: 0.58250000000000002
In [120]: lr.predict(xpt)
Out[120]:
array([1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1,
1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
1, 0, 0, 0, 1, 1, 0, 1, 0], dtype=int32)

Notice that the transformation made the problem “easier” in that it was solved with 1-d instead of 2-d machine learning and that the two groups appear even more separated after the transformation than before.

Lesson 1: Look at all of the dimensions–in this case the smallest variation axis rather than the largest variation axis solves the problem.  This is going to catch anyone who blindly applies PCA for machine learning.  See Part 3.

In [121]: ########################
In [122]: # Demo Part 3
In [123]: ########################
In [124]: # Use PCA idea to reduce to 1-D
In [125]: s_red = copy(s)
In [126]: s_red[1] = 0
In [127]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [128]: X_prime = matrix(V_t).T*matrix(S_red).T
In [129]: x_prime = np.asarray(X_prime.T[0])
In [130]: y_prime = np.asarray(X_prime.T[1])
In [131]: figure()
Out[131]: <matplotlib.figure.Figure at 0x1193e7450>
In [132]: plot(x_prime, y_prime, 'yo');
In [133]: title("Reduce S by removing s[1] = %2.5f"%s[1])

All original group information lost.

In [134]: # Try all new basis directions (not just greatest variations)
In [135]: # 1-D: Max variation - Training along x-dim performs poorly
In [136]: ypt = np.asarray(y_prime.T)
In [137]: lr.fit(ypt, truth[0])
Out[137]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [138]: lr.score(ypt, truth[0])
Out[138]: 0.5
In [139]: lr.predict(ypt)
Out[139]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [140]: # This is the other extrema to "principal" components
In [141]: s_red = copy(s)
In [142]: s_red[0] = 0
In [143]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [144]: X_prime = matrix(V_t).T*matrix(S_red).T
In [145]: x_prime = np.asarray(X_prime.T[0])
In [146]: y_prime = np.asarray(X_prime.T[1])
In [147]: figure()
Out[147]: <matplotlib.figure.Figure at 0x11c74f6d0>
In [148]: plot(x_prime, y_prime, 'mo');
In [149]: title("Reduce S by removing value s[0] = %2.5f"%s[0])

Trivial classification, now in 1-d.

In [150]: # 1-D: Min variation - Training along y-dim nearly perfect
In [151]: ypt = np.asarray(y_prime.T)
In [152]: lr.fit(ypt, truth[0])
Out[152]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [153]: lr.score(ypt, truth[0])
Out[153]: 0.99750000000000005
In [154]: lr.predict(ypt)
Out[154]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)

So our 1-d model performs great. Quoting from Wikipedia,

“PCA essentially rotates the set of points around their mean in order to align with the principal components. This moves as much of the variance as possible (using an orthogonal transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be small and may be dropped with minimal loss of information.”

The problem with following this without skepticism is that, according to the guidelines of PCA, our problem depends on using the “wrong” dimension.  (The properties are not quite as arbitrary as they may seem. For example, race data for long distances that includes both men and women have this quality because the variation within gender of total race times can be greater than the variation between gender times.)

Code on Github