Originally Posted By: sivan
how do you determine cluster middles? searching for local density maximums somehow?


I do particular nothing, that is an inherent quality of the general k-means algorithm, since the mean distance to the cluster center is minimized. What I do is that I just say "no more than x" datapoints per cluster and the algorithm creates an approximate solution for it. I create per dataset several cluster instances and do some analysis on the cluster statistics (especially the variance on the size) to decide which cluster-instances are good and which are not, which has a relative balancing effect on the overall quality. If datapoints result in clusters, that are below the lower bound, I mask all datapoints in clusters which are exceeding both bounds and apply the same algorithm on the remaining points and merge the results later on. If I cant even make 2 fitting clusters on the datapoints, I do some sort of brute force solution: sorting the points in terms of distance to the smaller cluster and assigning in ascending distance order until the point swap is even.

So, at the end, I don't know beforehand, how much clusters are "there"; I just have the boundary constraint of the cluster size and I am searching in a recursive fashion and changing focus on datapoints.

Last edited by HeelX; 02/01/12 13:53.