- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

I have been using the K-Means clustering from scikit-learn with Intel Python (update 3) and I noticed that it seems to ignore the option "n_init". From the scikit-learn documentation, "n_init" is the "Number of time the k-means algorithm will be run with different centroid seeds." I was wondering what the default value of "n_init" is (scikit-learn doc says 10).

To test this I ran the following:

from sklearn.cluster import KMeans import numpy as np from sklearn.datasets import fetch_mldata import time data = fetch_mldata('MNIST original').data start = time.time() kmeans = KMeans(n_clusters=64,n_init=1).fit(data) print("n_init=1 : "+str(time.time()-start)) start = time.time() kmeans = KMeans(n_clusters=64,n_init=10).fit(data) print("n_init=10 : "+str(time.time()-start)) start = time.time() kmeans = KMeans(n_clusters=64,n_init=1000).fit(data) print("n_init=1000: "+str(time.time()-start))

This is the result I get (HW: Intel Xeon Phi processor 7210):

n_init=1 : 6.0663318634 n_init=10 : 4.0244550705 n_init=1000: 4.10864305496

I am guessing the first one is due to some sort of initialization. But the other "n_init"=10 and "n_init"=100 get the same performance. I know that K-Means can have different run-times based on how "lucky" the initial guess was, but I don't think "n_init"=10 and "n_init"=1000 would get the same performance even then.

I tried creating an "init" array from the original dataset, and the performance I got with that makes me think that the default is "n_init"=1. But I can't seem to figure out what it is exactly (partly because "verbose=1" does not work either).

Thanks,

Ryo Asai

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi, thank you for your interest to our optimizations for sklearn in Intel Distribution for Python.

I checked issue you report. You are right, I see that we skipped by mistake this option. It means that we use default DAAL's value there. DAAL's default value according documentation is 5. It will fixed as soon as possible in updates.

If you want to disable our optimizations and enable original sklearn's behaviour you can do it with the following:

from sklearn.daal4sklearn import dispatcher dispatcher.disable('KMeans')

Denis.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi Denis,

Thank you for the fast response!

- Ryo

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page