Intel® Distribution for Python*
Engage in discussions with community peers related to Python* applications and core computational packages.
424 Discussions

scikit-learn K-Means "n_init" option not working

Ryo_A_
Beginner
2,845 Views

Hi,

I have been using the K-Means clustering from scikit-learn with Intel Python (update 3) and I noticed that it seems to ignore the option "n_init". From the scikit-learn documentation, "n_init"  is the "Number of time the k-means algorithm will be run with different centroid seeds." I was wondering what the default value of "n_init" is (scikit-learn doc says 10).

To test this I ran the following:

from sklearn.cluster import KMeans
import numpy as np
from sklearn.datasets import fetch_mldata
import time
data = fetch_mldata('MNIST original').data
start = time.time()
kmeans = KMeans(n_clusters=64,n_init=1).fit(data)
print("n_init=1   : "+str(time.time()-start))
start = time.time()
kmeans = KMeans(n_clusters=64,n_init=10).fit(data)
print("n_init=10  : "+str(time.time()-start))
start = time.time()
kmeans = KMeans(n_clusters=64,n_init=1000).fit(data)
print("n_init=1000: "+str(time.time()-start))

This is the result I get (HW: Intel Xeon Phi processor 7210):

n_init=1   : 6.0663318634
n_init=10  : 4.0244550705
n_init=1000: 4.10864305496

I am guessing the first one is due to some sort of initialization. But the other "n_init"=10 and "n_init"=100 get the same performance. I know that K-Means can have different run-times based on how "lucky" the initial guess was, but I don't think "n_init"=10 and "n_init"=1000 would get the same performance even then. 

I tried creating an "init" array from the original dataset, and the performance I got with that makes me think that the default is "n_init"=1. But I can't seem to figure out what it is exactly (partly because "verbose=1" does not work either).

Thanks,
Ryo Asai
 

0 Kudos
2 Replies
Denis_N_Intel
Employee
2,845 Views

Hi, thank you for your interest to our optimizations for sklearn in Intel Distribution for Python.
I checked issue you report. You are right, I see that we skipped by mistake this option. It means that we use default DAAL's value there. DAAL's default value according documentation is 5. It will fixed as soon as possible in updates. 

If you want to disable our optimizations and enable original sklearn's behaviour you can do it with the following:

from sklearn.daal4sklearn import dispatcher
dispatcher.disable('KMeans')

Denis.

0 Kudos
Ryo_A_
Beginner
2,845 Views

Hi Denis,

Thank you for the fast response!

- Ryo

0 Kudos
Reply