KMeans in DAAL doesn't follow a "model training" --> "prediction" usage model. There isn't an opaque "model" object in KMeans result. That said, however, you can mimic a model object by extracting centroids from the result, serializing the centroids numeric table (together with other information such as number of clusters). And then you use these as input for clustering for your new data.
Ok, but if I initialized the kmeans with the previously calculated centroids (as well as the number of clusters) and recomputed, then the centroids probably would have moved. Is there a way to call compute on the kmeans algorithm without it trying to recalculate the centroids - would iterations=0 do this? I guess its essentially nearest neighbour on the centroids?
Surely there has to be a way, for this particular application I may well have several billion rows so I have to segment on a sample, but I have to come out with a segmentation for all of them?
Starting from upcoming release we've updated algorithm logic a bit, improved documentation and added specific examples for your case. And the only thing you'll need to do - set interations=0.
The simplest way to do it with Intel DAAL 2016 is:
kmeans::Distributed<step1Local> alg(nClusters, true);
assignments = alg.getResult()->get(kmeans::assignments);
The key here, is to use distributed version of the algorithm even if data is all local. centroids will not move in this case.