A recently published paper explains a Bayesian approach to clustering. Revisiting k-means: New Algorithms via Bayesian Nonparametrics motivates and explores the idea of using a scale parameter $\lambda$ to control the creation of new clusters during clustering rather than requiring the Data Scientist to set the number of clusters, $k$, as in k-means.  John Myles White coded this in R and shows some example clusterings with varying $\lambda$, but doesn’t dig into quantitiative comparisons. (BTW, subscribe to his blog.)

After looking at John’s plots, you may ask if there is any better motivation for choosing a scale parameter than the number of clusters–both seem ad hoc and to require experienced judgement to get the “best” result.  (I get fidgety when people say they just used k-means and it worked great–k-means always gives an answer, so “success” in the simple sense doesn’t mean much.)

In what ways could dp-means be an improvement over k-means?

• Parameter choice.  The scale of the upper and lower bounds can be calculated from the data.  In general, we can bound $\lambda$ at the high end with the range of the sample data and at the lower end, some measure of the nearness of the nearest data points, or possible, the smallest expected cluster size.
• Time cost to minimize error.  K-means time cost is approximately linear in k, and at first glace the time-scaling of dp-means with number of clusters (not with $\lambda$) appears to scale linearly as well (with dp-means, smaller $\lambda$ corresponds to more clusters), but it is not clear that this is better or worse than k-means in practical cases.

There are no proofs here, just numerical exploration and intuition building.  I coded dp-means in Python in a way that let me leave as much common code between k-means and dp-means and get lots of diagnostics.  These implementations aren’t optimized.  The code and examples shown here are available on github.

First, a 2-d version with three nicely separated clusters.  Here’s the original data,

A couple of the clusters are spread along one dimension to make it a little more interesting. Here’s an example of training dp-means:

And the mean-squared error per sample point:

In my versions of by k-means and dp-means, the algorithm stops when the change of error between iterations drops below a pre-defined tolerance. Download the code and play with training dp-means with varying scale parameter to see how decreasing  $\lambda$ increases the number of clusters.

Ok, let’s get on with the comparison…in this example, we have 3 features and 5 overlapping clusters.

The error vs. parameter plots for dp-means and k-means are shown below.

The parameter for k-means is the number of clusters while parameter I am plotting for dp-means is $1 /\lambda$ (approximately the reciprocal of the minimum cluster size), so they cannot really be compared. However, in this graph it is easy to see where the number of clusters in dp-means match those in k-means.

For dp-means, there are no changes in the error if we set the parameter $\lambda > \sqrt{3}(x_{max}-x_{min})$ because no cluster will be larger than a cube containing data.  Both k-means and dp-means continue to improve with additional clusters until each sample point is at the center of its own cluster.

The classic “elbow” at  $k = 3$  indicates separated clusters found by k-means.  Around $1 / \lambda \approx 3$ there is a “lap” (continuing with the body metaphors) for dp-means.  Is this a reliable heuristic for training dp-means?  It results in 11 clusters.

The dp-means algorithm achieves 4 clusters (fairly consistently) around  $\lambda \approx 1.7$.

With respect to parameter, dp-means may come out slightly ahead, but it might be something of a matter of taste.

How about the time cost of lower error? I both algorithms we have to search the space for the parameter value and cluster centers that give the minimum error. This means running the algorithm repeatedly, where we achieve a range of local minima on each iteration and choose the lowest value as our best fit.  So one way to look at the time-cost of each algorithm is to compare the minimum error for a range of total search iterations.

The graphs below plot $\log(error)$ vs. $\log(time)$ for dp-means and k-means for 4, 10, and 20 search iterations.  Dp-means is slightly more efficient in each case, but only slightly.

Conclusion. I will continue to explore dp-means because of the parameter advantages, but the time advantages seem negligible.  What do you think?