Nice. I guess this must have existed in some form for GPUs before this, and that this is just better with the latest software stack?
This could be generally useful for a lot of algorithms like kmeans(ish) like Cut algorithm that projects points into eigen space — as opposed to the kernel space in Kmeans — need the affinity matrix(distance for K points).
By ghm2199 2 hours ago
Do they mean deterministic k-means, k-means++ ... ? Global optimal k-means is NP-Hard, so linear speedups aren't terribly helpful. It's nice, until you add more input. Standard k-means would be nice, or the k-means++ seed algorithm.
By leecarraher 10 hours ago
Kmeans++ is just a seed, this is the inner loop.
Also analogous to flash attention, a linear speedup in big O sense based on the typical algorithmoc complexity computing model can be a polynomial speedup in measured wall clock time due to memory hierarchy differences.
Still small compared to exponential differences, but for an NP-Hard problem, a linear 100x speedup is the difference between practically computable vs. not. There are a ton of things I'd wait 2 hours for that I wouldn't wait a week for.
By jmalicki 10 hours ago
The abstract suggests they're proposing speed-up techniques for the assignment and centroid update stages of the classic k-means algorithm. Which would therefore also apply to k-means++.
By n4r9 6 hours ago
Does this have corresponding speed ups or memory gains for normal CPUs too? Just thinking about all the cups of coffee that have been made and drunk while scikit-learn kmeans chugs through a notebook :)
By wood_spirit 14 hours ago
For CPU with bigger K you would put the centroids in a search tree, so take advantage of the sparsity, while a GPU would calculate the full NxK distance matrix. So from my understanding the bottleneck they are fixing doesn't show up on CPU.
By snovv_crash 14 hours ago
search trees tend not to scale well to higher dimensions though, right?
from what I've seen I had the impression that Yinyang k-means was the best way to take advantage of the sparsity.
By xavxav 13 hours ago
Most data I've used is for geospatial with D<=4 (xyzt) so for me search trees worked great. But for things like descriptor or embedding clustering yes, trees wouldn't be useful.
By snovv_crash 8 hours ago
Nice one. K-Means is one of those neat little powertools that once you get the hang of it you find more and more applications for, but it can be a bit slow for larger data sets. So this is very nice to have, thank you matt_d for posting.
By jacquesm 10 hours ago
Exact K-Means but actually practical—faster, leaner, and finally scalable without approximation trade-offs.
By QubridAI 3 hours ago
looks like flash attention concepts applied to kmeans, nice speedup results
> Abstract: [...] Flash-kmeans introduces two core kernel-level innovations: (1) FlashAssign, which fuses distance computation with an online argmin to completely bypass intermediate memory materialization;
> (2) sort-inverse update, which explicitly constructs an inverse mapping to transform high-contention atomic scatters into high-bandwidth, segment-level localized reductions.
> Furthermore, we integrate algorithm-system co-designs, including chunked-stream overlap and cache-aware compile heuristics, to ensure practical deployability.
> [...] flash-kmeans achieves up to 17.9X end-to-end speedup over best baselines, while outperforming industry-standard libraries like cuML and FAISS by 33X and over 200X, respectively.
By frakt0x90 12 hours ago
By smusamashah 8 hours ago
By ghm2199 2 hours ago
By leecarraher 10 hours ago
By jmalicki 10 hours ago
By n4r9 6 hours ago
By wood_spirit 14 hours ago
By snovv_crash 14 hours ago
By xavxav 13 hours ago
By snovv_crash 8 hours ago
By jacquesm 10 hours ago
By QubridAI 3 hours ago
By matrix2596 14 hours ago
By westurner 6 hours ago