This Strata (NY, 2012) talk caught my attention more than any talk at the conference. Ms. Ravich made a request for developers to create better decision tools. (Did she confuse this group for a mythical Software Engineer/Game Theory conference?)

Ms. Ravich started with “I am not a big fan of the information revolution.” That’s a gutsy start given the crowd. But fortunately we were all drowsy, no one reacted. Technically, she was one of the best speakers–she spoke clearly and slowly, her argument was logically organized, she told a good story, and used a powerful myth as a supporting metaphor for her point.

The form of the request was shaped by the idea of fast and slow thinking. Fast thinking at its best synthesizes and sorts quickly. You need fast thinking to sort out what to think slowly about. Then she delivered a couple of assertions. “I think strategic decision makers are in real danger of the information revolution swamping our ability to do fast thinking. And that’s the very attribute we need to do to make the hard policy choices.”

What does “information revolution” mean? Apparently it is a movement or -ism or evolution or situation that can change basic human psychology and erodes the ability to do fast thinking. And what is the case for more fast thinking in policy making? Heuristics for decision making are so natural we barely realize we are using them. They are great because they are fast and we feel certain about them. Also, they can be create huge liabilities when used to make decisions about long-term policy. That feeling of certainty is associated with confirmation bias, attention bias, willful framing naivete, unconscious anchoring biases, …

Ravich goes on to explain the assertions above with an example from the Bush (43) administration dealing with the challenges of nation building in Afghanistan. Afghan was growing a lot (most) of World’s opium poppies. I am sure this caused many economic, border, organized crime, monetary, etc problems. But Ms. Ravich’s explanation for why this was bad was that it offended our national pride. So, we decided to destroy the poppies. This did not endear us with the farmers nor stop them from growing poppies.

Ravich explains that the poor process of making the decision was due to the inability of decision makers to “rack and stack the importance of each bit of information to see how it aligned with our goal.”

Following this explanation was the request: “If strategic decision makers in the situation room are going to win the information revolution, developers need a better insight into the thought process of how the policy decision makers reason and think, how we assemble and prioritize information.”

I am afraid I heard something a little like this… Look, we are good at making gut decisions. We can make them fast. We feel and act confidently about them. But you guys didn’t make the proper context for our heuristics and biases so they didn’t reflect reality. Do better next time.

On one hand, fair enough. That’s the job I signed up for. But it also seems there is room here for more responsible accounting for biases on the part of the decision makers? And that sometimes means wading through boring data and trying to understand something you don’t already understand.

tags: , ,

At Gnip, we parse about half a billion JSON activities from our firehoses of social media every day. Until recently, I believed that the time I would save parsing social activities with C++ command line tool would more than justify additional time it takes to develop in C++. This turns out to be wrong.

Comparing the native JSON parser in Python2.7 and the UltraJSON parser to a C++ implementation linked to jsoncpp indicates that UltraJSON is by far the best choice, achieveing about twice the parsing rate of C++ for Gnip’s normalized JSON Activity Stream format. UltraJSON parsed Twitter activities at near 20MB/second.

Plot of elapsed time to parse increasingly large JSON files.  (Lower numbers are better.)

Additional details, scripts, data and code is available on github.

In my last post I compared dp-means and k-means error functions and run times.  John Myles White pointed to some opportunities that come from $\lambda$ being a continuous variable.

Evolving the test code I posted on github, I developed a quick-and-dirty proof of concept.

First, below is the parameter vs. error graph in its latest incarnation.  There are two important changes from the analogous graph from last post:

• Instead of using the k-means cost function to make the timing, error comparisons as I did before, I am now plotting the traditional k-means cost function for k-means and the cost function for dp-means,

$\text{Cost(K-means)} + \lambda k$

• I am not plotting against $\text{data range}/\lambda$ for comparison
• I am plotting errors for a data set not used in training (called cross-validation in the code).

The cost function for dp-means shows a clear minimum. This graph is slightly confusing because the parameter for k-means, k, the number of clusters, increases left-to-right, while the number of clusters in dp-means goes down with increasing parameter $\lambda$.

I wrote a small script that leverages SciPy to optimize the dp-means cost function in order to determine the optimal value of $\lambda$, and therefore the number of clusters.

Here is an example on one of the data sets included as an example “input” directory.  This code runs slowly, but converges to a minimum at,

lambda: 5.488
with error: 14.2624

Here is a sample training at the optimal value with only the data as input (the code determines everything needed from the data.)

Figure shows training iterations for centers, training
data membership, cross-validation data membership.

The code is rough and inefficient, but the method seems robust enough to proceed to work on smoothing things out and run more tests. Neat.

A recently published paper explains a Bayesian approach to clustering. Revisiting k-means: New Algorithms via Bayesian Nonparametrics motivates and explores the idea of using a scale parameter $\lambda$ to control the creation of new clusters during clustering rather than requiring the Data Scientist to set the number of clusters, $k$, as in k-means.  John Myles White coded this in R and shows some example clusterings with varying $\lambda$, but doesn’t dig into quantitiative comparisons. (BTW, subscribe to his blog.)

After looking at John’s plots, you may ask if there is any better motivation for choosing a scale parameter than the number of clusters–both seem ad hoc and to require experienced judgement to get the “best” result.  (I get fidgety when people say they just used k-means and it worked great–k-means always gives an answer, so “success” in the simple sense doesn’t mean much.)

In what ways could dp-means be an improvement over k-means?

• Parameter choice.  The scale of the upper and lower bounds can be calculated from the data.  In general, we can bound $\lambda$ at the high end with the range of the sample data and at the lower end, some measure of the nearness of the nearest data points, or possible, the smallest expected cluster size.
• Time cost to minimize error.  K-means time cost is approximately linear in k, and at first glace the time-scaling of dp-means with number of clusters (not with $\lambda$) appears to scale linearly as well (with dp-means, smaller $\lambda$ corresponds to more clusters), but it is not clear that this is better or worse than k-means in practical cases.

There are no proofs here, just numerical exploration and intuition building.  I coded dp-means in Python in a way that let me leave as much common code between k-means and dp-means and get lots of diagnostics.  These implementations aren’t optimized.  The code and examples shown here are available on github.

First, a 2-d version with three nicely separated clusters.  Here’s the original data,

A couple of the clusters are spread along one dimension to make it a little more interesting. Here’s an example of training dp-means:

And the mean-squared error per sample point:

In my versions of by k-means and dp-means, the algorithm stops when the change of error between iterations drops below a pre-defined tolerance. Download the code and play with training dp-means with varying scale parameter to see how decreasing  $\lambda$ increases the number of clusters.

Ok, let’s get on with the comparison…in this example, we have 3 features and 5 overlapping clusters.

The error vs. parameter plots for dp-means and k-means are shown below.

The parameter for k-means is the number of clusters while parameter I am plotting for dp-means is $1 /\lambda$ (approximately the reciprocal of the minimum cluster size), so they cannot really be compared. However, in this graph it is easy to see where the number of clusters in dp-means match those in k-means.

For dp-means, there are no changes in the error if we set the parameter $\lambda > \sqrt{3}(x_{max}-x_{min})$ because no cluster will be larger than a cube containing data.  Both k-means and dp-means continue to improve with additional clusters until each sample point is at the center of its own cluster.

The classic “elbow” at  $k = 3$  indicates separated clusters found by k-means.  Around $1 / \lambda \approx 3$ there is a “lap” (continuing with the body metaphors) for dp-means.  Is this a reliable heuristic for training dp-means?  It results in 11 clusters.

The dp-means algorithm achieves 4 clusters (fairly consistently) around  $\lambda \approx 1.7$.

With respect to parameter, dp-means may come out slightly ahead, but it might be something of a matter of taste.

How about the time cost of lower error? I both algorithms we have to search the space for the parameter value and cluster centers that give the minimum error. This means running the algorithm repeatedly, where we achieve a range of local minima on each iteration and choose the lowest value as our best fit.  So one way to look at the time-cost of each algorithm is to compare the minimum error for a range of total search iterations.

The graphs below plot $\log(error)$ vs. $\log(time)$ for dp-means and k-means for 4, 10, and 20 search iterations.  Dp-means is slightly more efficient in each case, but only slightly.

Conclusion. I will continue to explore dp-means because of the parameter advantages, but the time advantages seem negligible.  What do you think?

tags: ,

Maybe this is a counter example.  The outcome of this activity illustrates one of my points for  my introductory “R, For Example” post. Specifically, the format of the data is nearly everything.  Here’s the backstory of the data.  I’ll wait.  Okay, here’s the R problem…

The goal was to create a stacked chart of aligned time series plots like this:

The plots have different y-scales–and this is where the trouble starts. ggplot does a great job of generating every appropriate y-tick marks and giving everything that ggplot prettiness. But if you do this plot naively, you get a plot where the x-ranges are identical, but not aligned.  Here was my original graph,

The arrow shows the alignment problem. This is because printing the scales takes up different space, depending on the magnitude (decade, really) of the tick labels.

I imagined the easy way to fix this was a custom formatter, so I write a quick bit of code to format the y-tick labels with the right number of leading spaces.

# Fix it by manually formatting the y-tick labels
lalb <- function(x) {
lx <- sprintf('%04d', x) # Format the strings as HH:MM:SS
lx <- gsub('^000', ' ', lx) # Remove leading 00: if present
lx <- gsub('^00', ' ', lx) # Remove leading 00: if present
lx <- gsub('^0', ' ', lx) # Remove leading 0 if present
}


Quick and easy, right? Nope. The tick labels use proportionally spaced fonts, so this gets hack gets us closer, but not quite there.  In the end, I had to add some artificial padding to the tick label. I ended up adding “8” to the plot with the largest scale and this pushed everything far enough right to get it pretty close.  The final results is shown in the first graph of the post.

Lesson: A better way to do this is to load the data in column form and make each publisher a type (in a single column) instead of making each publisher its own column. R likes tall data frames. Then a ggplot-generated grid of plots would be aligned perfectly.

Here’s the code:

#!/usr/bin/env Rscript
library(ggplot2)
library(gridExtra)
library(scales)

# data in columns, joined on date field
x$date = as.POSIXct(x$time)
summary(x)

# Make some times series plots
p1 <- ggplot(data=x) +
geom_line(aes(date, twcount), color="red") +
p3 <- ggplot(data=x) +
geom_line(aes(date, wpcccount + wpoccount), color="red") +
p4 <- ggplot(data=x) +
geom_line(aes(date, ngcount), color="red") +
opts(title="Newsgator")+ xlab("Time (UTC)")

png(filename = "./timelines_noalign.png", width = 550, height = 750, units = 'px', res = 100)
print(
grid.arrange(p1, p3, p4 ,ncol = 1,
main=textGrob("Earthquake - Mexico - 20 March 2012", gp=gpar(fontsize=14, lineheight=18)) )
)

# Fix it by manually formatting the y-tick labels
lalb <- function(x) {
lx <- sprintf('%04d', x) # Format the strings as HH:MM:SS
lx <- gsub('^000', ' ', lx) # Remove leading 00: if present
lx <- gsub('^00', ' ', lx) # Remove leading 00: if present
lx <- gsub('^0', ' ', lx) # Remove leading 0 if present
}

# Unfortunately, do to proportional spaced fonts?, spaces dont' quite do ti. Add a manual adjustment.
p1 <- ggplot(data=x) +
geom_line(aes(date, twcount), color="red") +
opts(title="Twitter", axis.title.x = theme_blank(), axis.title.y = theme_text(hjust=8)) +
ylab("") + scale_y_continuous(labels = lalb)
p3 <- ggplot(data=x) +
geom_line(aes(date, wpcccount + wpoccount), color="red") +
opts(title="Wordpress Comments", axis.title.x = theme_blank()) +
ylab("") + scale_y_continuous(labels = lalb)
p4 <- ggplot(data=x) +
geom_line(aes(date, ngcount), color="red") +
opts(title="Newsgator")+ xlab("Time (UTC)") +
ylab("") + scale_y_continuous(labels = lalb)

png(filename = "./timelines_align.png", width = 550, height = 750, units = 'px', res = 100)
print(
grid.arrange(p1, p3, p4 ,ncol = 1,
main=textGrob("Earthquake - Mexico - 20 March 2012", gp=gpar(fontsize=14, lineheight=18)) )
)
dev.off()


Data and code on github.

tags: ,

Social media stories evolve according to a particular pattern when surprise events are experience by many social media users at once.  An example from Twitter was the earthquake in Oaxaca, MX on 20 March 2012.  The Twitter volume for terms “quake” and “terremoto” are shown in the figure.

The Social Media Pulse for simultaneously observed events can be modeled as a Stationary Poisson Point Process.  The model yields a Gamma distribution,

$f_S(t)=\frac{\beta^{-\alpha}(t-t_0)^{\alpha-1}\exp(\frac{-(t-t_0)}{\beta}) }{\Gamma(\alpha)}$

and the rate,

$r(t)=N_{activities}f_S(t)$

(Another, slightly more tractable model analytically for the The Social Media Pulse is to fit with a double exponential, $r(t) = r_0(1-\exp(-\alpha (t - t_0))\exp(-\beta (t-t_0))$)

Doing this enables consistent comparison of a story across social media publishers as well as comparisons of various stories.

A detailed writeup of the pulse parameters and fitting is available at Social Media Pulse and additional discussion of social media firehoses can be found in my Gnip blog series Taming the Social Media Firehose.

Of the fast and loose practices in machine learning that I cringe at, throwing one’s favorite tool at high-dimensional data and expecting algorithms to learn to generalize well.

Part of the problem is we don’t always know if the features we chose are meaningful to the learning problem. Another part of the problem is that a useful regression requires that our sample fills the space adequately to regress.  (An n-sphere has a volume that scales as $V \approx R^n$.  This means that if you feel comfortable with 10 samples for a 1-d linear regression, you should have in your mind a million samples for a 6-d regression.)

One strategy for dealing with high-dimensionality is to rotate and scale the data set along axes of high sample variation. Two closely related mathematical tools are available to help with this: Singular Value Decomposition (SVD) and Principal Component Analysis). I won’t add anything to the mathematics described in these articles.  The purpose here is

1. Explore some questions with a simple example
2. Explore the link between SVD and PCA with some simple data to understand how it works, and build a little bit of intuition on what linear transformation can do for machine learning.
3. Demonstrate how this strategy could work to reduce the dimentionality of a problem (This really turns out to be reducing the dimentionality of the representation of the problem.)
4. Show how this strategy can easily go wrong in order to build intuition about when and how this might work for a real machine learning problem.
There is a link to the all of the code at the bottom.  To get this output, install IPython, the matplotlib, scipy, numpy, scikits.learn packages, then paste code into a session.  Or just edit the script to include the output you want and run from the command line.
Here we go…first SVD.

In [23]: ########################
In [24]: # Demo Part 1
In [25]: ########################
In [26]: # SVD: decompose a matrix X into product of unitary matrix (U), diagonal
In [27]: # matrix (S) and vector V, a rotation, (if it helps, think:
In [28]: # basis vectors, scaling, rotation) such that X.T = W x S x V_t.
In [29]: #
In [30]: # In machine learning, X will be a matrix of samples (4, rows)
In [31]: # and features (3, columns) for our learning examples.
In [32]: X_t = mat([ [ 1, 3, -10 ],
....: [ 0, 2, 1 ],
....: [ -1, -3, 9 ],
....: [ 0, -2, 0 ] ])
In [33]: X = X_t.T
In [34]: # The scipy library makes this step easy
In [35]: W, s, V_t = linalg.svd( X )
In [36]: S = linalg.diagsvd(s, len(X), len(V_t))
In [37]: recon = dot( dot( W, S), V_t)
In [38]: # Are these equal (to within rounding)?
In [39]: abs(X - recon)
Out[39]:
matrix([[ 4.44089210e-16, 1.73472348e-16, 6.66133815e-16,
1.59594560e-16],
[ 8.88178420e-16, 2.22044605e-16, 4.44089210e-16,
1.33226763e-15],
[ 1.77635684e-15, 4.44089210e-16, 5.32907052e-15,
6.73940070e-16]])
In [40]: # maximum error
In [41]: np.max(abs(X - recon))
Out[41]: 5.3290705182007514e-15
In [42]: # One key to understanding the link to PCA is
In [43]: # understanding the diagonal matrix, S
In [44]: S
Out[44]:
array([[ 14.19266482, 0. , 0. , 0. ],
[ 0. , 2.9255743 , 0. , 0. ],
[ 0. , 0. , 0.09633518, 0. ]])
In [45]: # Given that the features have zero-mean:
In [46]: [np.mean(i) for i in X]
Out[46]: [0.0, 0.0, 0.0]
In [47]: # s is an ordered vector that tells us the "significance" of each dimension
In [48]: # in the rotated space. (Yes, I arranged it that way. To be clear,
In [49]: # you can do SVD on a matrix without zero-mean features, but the dimension
In [50]: # reduction part we are about to do requires it.)
In [52]: # We can selectively set lower numbers to
In [53]: # zero to reduce dimension.
In [54]: s_red = s
In [55]: s_red[2] = 0
In [56]: # Approximately reconstruct our original matrix, but with
In [57]: # a reduced-dimension representation
In [58]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [59]: S_red
Out[59]:
array([[ 14.19266482, 0. , 0. , 0. ],
[ 0. , 2.9255743 , 0. , 0. ],
[ 0. , 0. , 0. , 0. ]])
In [60]: recon_red = dot( dot( W, S_red), V_t)
In [61]: abs(X - recon_red)
Out[61]:
matrix([[ 0.04297068, 0.04061026, 0.05215936, 0.05451977],
[ 0.00118308, 0.00111809, 0.00143606, 0.00150105],
[ 0.00412864, 0.00390185, 0.00501149, 0.00523828]])
In [62]: # maximum error
In [63]: np.max(abs(X - recon_red))
Out[63]: 0.054519772460470559
In [64]: # ratio of errors
In [65]: np.max(abs(X - recon))/np.max(abs(X - recon_red))
Out[65]: 9.7745648554652029e-14
In [66]:
In [66]: # We "lost" 14 orders of magnitude in precision of the reconstruction, but this turns
In [67]: # out to be okay for some machine learning problems.


So that is a very simple SVD example.  Now use the same representation for a simple machine learning problem.   Classify to groups of data in a two dimensional space using Logistic Regression.   (Logistic regression is very good a solving problems just like this without SVD or PCA, so this is merely to get at how it all works together.)

 In [68]: ########################
In [69]: # Demo Part 2
In [70]: ########################
In [71]: from scikits.learn import linear_model
In [72]: import scipy
In [73]: from copy import copy
In [74]: # Classification problem where points are linearly classifiable
In [75]: # in 2 dim.
In [76]: N=200
In [77]: x = np.random.normal(0,4,N)
In [78]: y1 = 3*x + np.random.normal(3,1,N)
In [79]: y2 = 3*x + np.random.normal(-3,1,N)
In [80]: y_avg = np.mean(np.append(y1, y2, 1))
In [81]: figure()
In [82]: plot(x,y1 - y_avg, 'bo');
In [83]: plot(x,y2 - y_avg, 'ro');
In [84]: title("Original data sets (0-average)")


Boundary between groups at $y(x) \approx 3x$

In [85]: # features x and y are rows in this matrix
In [86]: X = np.append([x,y1 - y_avg],[x,y2 - y_avg],1)
In [87]: X_t = X.T
In [88]: # y1 group 0; y2 group 1
In [89]: truth = np.append(scipy.ones([1, N]), scipy.zeros([1, N]), 1)
In [90]: # 2d model works very well
In [91]: lr = linear_model.LogisticRegression()
In [92]: lr.fit(np.asarray(X_t),truth[0])
Out[92]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [93]: lr.score(np.asarray(X_t),truth[0])
Out[93]: 1.0
In [94]: lr.predict(np.asarray(X_t))
Out[94]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
In [95]: # Now try dimension reduciton with SVD
In [96]: W, s, V_t = linalg.svd( X )
In [97]: s
Out[97]: array([ 273.98689602, 19.47579928])
In [98]: # both transformed representations are the same before trimming:
In [99]: S = linalg.diagsvd(s, len(X), len(V_t))
In [100]: np.max(abs(X.T*matrix(W) - matrix(V_t).T*matrix(S).T))
Out[100]: 6.7501559897209518e-14
In [101]: # Now work with the transformed coordinates. It might not have been clear
In [102]: # from above what the transformed coordinate system was. We can get there
In [103]: # by either the product of the first two or last two terms.
In [104]: X_prime = matrix(V_t).T*matrix(S).T
In [105]: x_prime = np.asarray(X_prime.T[0])
In [106]: y_prime = np.asarray(X_prime.T[1])
In [107]: figure()
In [108]: plot(x_prime, y_prime, 'go');
In [109]: title("Features after SVD Transformation")
Out[109]: <matplotlib.text.Text at 0x11b956890>


Boundary between groups at $y(x) \approx 0$

In [110]: # Linearly classifiable in 1-d? Try all new basis directions (extremes of variation)
In [111]: # Min variation - Training along y-dim nearly perfect
In [112]: ypt = np.asarray(y_prime.T)
In [113]: lr.fit(ypt, truth[0])
Out[113]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [114]: lr.score(ypt, truth[0])
Out[114]: 0.99750000000000005
In [115]: lr.predict(ypt)
Out[115]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
In [116]: # Max variation - Nothing here
In [117]: xpt = np.asarray(x_prime.T)
In [118]: lr.fit(xpt, truth[0])
Out[118]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [119]: lr.score(xpt, truth[0])
Out[119]: 0.58250000000000002
In [120]: lr.predict(xpt)
Out[120]:
array([1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,
1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0,
0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0,
0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1,
0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1,
0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1,
1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1,
1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0,
1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1,
1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1,
0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1,
0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0,
0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0,
0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0,
0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 1,
1, 0, 0, 0, 1, 1, 0, 1, 0], dtype=int32)


Notice that the transformation made the problem “easier” in that it was solved with 1-d instead of 2-d machine learning and that the two groups appear even more separated after the transformation than before.

Lesson 1: Look at all of the dimensions–in this case the smallest variation axis rather than the largest variation axis solves the problem.  This is going to catch anyone who blindly applies PCA for machine learning.  See Part 3.


In [121]: ########################
In [122]: # Demo Part 3
In [123]: ########################
In [124]: # Use PCA idea to reduce to 1-D
In [125]: s_red = copy(s)
In [126]: s_red[1] = 0
In [127]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [128]: X_prime = matrix(V_t).T*matrix(S_red).T
In [129]: x_prime = np.asarray(X_prime.T[0])
In [130]: y_prime = np.asarray(X_prime.T[1])
In [131]: figure()
Out[131]: <matplotlib.figure.Figure at 0x1193e7450>
In [132]: plot(x_prime, y_prime, 'yo');
In [133]: title("Reduce S by removing s[1] = %2.5f"%s[1])


All original group information lost.

In [134]: # Try all new basis directions (not just greatest variations)
In [135]: # 1-D: Max variation - Training along x-dim performs poorly
In [136]: ypt = np.asarray(y_prime.T)
In [137]: lr.fit(ypt, truth[0])
Out[137]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [138]: lr.score(ypt, truth[0])
Out[138]: 0.5
In [139]: lr.predict(ypt)
Out[139]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
In [140]: # This is the other extrema to "principal" components
In [141]: s_red = copy(s)
In [142]: s_red[0] = 0
In [143]: S_red = linalg.diagsvd(s_red, len(X), len(V_t))
In [144]: X_prime = matrix(V_t).T*matrix(S_red).T
In [145]: x_prime = np.asarray(X_prime.T[0])
In [146]: y_prime = np.asarray(X_prime.T[1])
In [147]: figure()
Out[147]: <matplotlib.figure.Figure at 0x11c74f6d0>
In [148]: plot(x_prime, y_prime, 'mo');
In [149]: title("Reduce S by removing value s[0] = %2.5f"%s[0])


Trivial classification, now in 1-d.

In [150]: # 1-D: Min variation - Training along y-dim nearly perfect
In [151]: ypt = np.asarray(y_prime.T)
In [152]: lr.fit(ypt, truth[0])
Out[152]:
LogisticRegression(C=1.0, intercept_scaling=1, dual=False, fit_intercept=True,
penalty='l2', tol=0.0001)
In [153]: lr.score(ypt, truth[0])
Out[153]: 0.99750000000000005
In [154]: lr.predict(ypt)
Out[154]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)


So our 1-d model performs great. Quoting from Wikipedia,

“PCA essentially rotates the set of points around their mean in order to align with the principal components. This moves as much of the variance as possible (using an orthogonal transformation) into the first few dimensions. The values in the remaining dimensions, therefore, tend to be small and may be dropped with minimal loss of information.”

The problem with following this without skepticism is that, according to the guidelines of PCA, our problem depends on using the “wrong” dimension.  (The properties are not quite as arbitrary as they may seem. For example, race data for long distances that includes both men and women have this quality because the variation within gender of total race times can be greater than the variation between gender times.)

Code on Github