Intel® oneAPI Data Analytics Library
Learn from community members on how to build compute-intensive applications that run efficiently on Intel® architecture.

Kmeans++ initialisation

JohnZhang
New Contributor I
1,005 Views

Can oneDal developer clarify the kmeans++ initialization algorithm. In document, it says, the first centroid is selected randomly. But when I run the algorithm repeatedly and the centroids seem to remain the same every time. I wonder if it is because of the random number generator inside the function is seeded the same way every time?

 

This is the link to the function and I used plus_plus_dense option.

https://oneapi-src.github.io/oneDAL/onedal/algorithms/clustering/kmeans-init.html#kmeans-init-c-math-plus-plus-dense

0 Kudos
10 Replies
AlekhyaV_Intel
Moderator
963 Views

Hi,


Thanks for posting in Intel Communities. We are reproducing the issue from our end. Meanwhile, could you please let us know the oneAPI version & your system information so that we can debug your issue further?


Regards,

Alekhya


0 Kudos
JohnZhang
New Contributor I
947 Views

I have oneDAL version 2023.2.0, running on Visual Studio 2022/Window 10. Thanks

0 Kudos
JohnZhang
New Contributor I
947 Views

Also, as a suggestion, it will be really useful to have an option to allow user to specify only the FIRST centroid and then let the plus_plus algorithm to calculate the rest.

0 Kudos
AlekhyaV_Intel
Moderator
793 Views

Hi,

 

We apologize for the delay. We have contacted the admin team regarding this issue and got an update that centroid initialization is it not always random. Please refer the below link:

https://oneapi-src.github.io/oneDAL/daal/algorithms/kmeans/k-means-clustering.html#centroid-initialization

It is based on the function and not on the randomness. Kindly change the dataset to 750 rather than 1000 rows which will help you to debug, run multiple times and check if the centroid changes or not.

If not then it is the function which is causing the issue not the randomness, kindly debug your code and check where is the issue.

 

If this resolves your issue, make sure to accept this as solution. This helps others with similar issues.

 

Regards,

Alekhya

 

0 Kudos
AlekhyaV_Intel
Moderator
760 Views

Hi,


Has the solution provided helped? Could you please give us an update regarding this issue?


Regards,

Alekhya


0 Kudos
JohnZhang
New Contributor I
756 Views

Many thanks for the rely. Could you elaborate on point 1 referred in the link? What exactly it means by "equal probability", does it mean every time, the same first center will be chosen? Thanks

 

K-Means++ algorithm [Arthur2007], which selects centers with the probability proportional to their contribution to the overall error ??) according to the following scheme:

  1. Chooses one of the feature vectors ?? from ? with equal probability.

  2. Excludes ?? from ? and adds it to the current set of centers ?.

0 Kudos
AlekhyaV_Intel
Moderator
693 Views

Hi,

 

Thanks for getting back to us. Let me elaborate this to you.

 

As per your first query regarding equal probability, equal probability means that each feature vector in the dataset has the same chance of being selected during the random choice process. Regardless of the characteristics of the vectors or the frequency of the vector x being chosen from dataset X, there is no bias towards any particular vector; they all have an identical probability of being chosen. As there's always a single dataset & every time first k feature vectors are considered and first vector x is randomly chosen from dataset X, you might be getting same centroid values everytime you run the sample. You could try changing some values in the dataset to see a different output or pass a different dataset.

 

Regarding the second part of your query, please find the response inline:

 

K-Means++ algorithm [Arthur2007], which selects centers with the probability proportional to their contribution to the overall error ??) according to the following scheme:

  1. Chooses one of the feature vectors xi from dataset X with equal probability - The K-Means++ algorithm starts by selecting the first center. This initial center is chosen from the set of feature vectors (data points) X with equal probability. Essentially, any of the feature vectors has an equal chance of being chosen as the first center.
  2. Excludes xi from X and adds it to the current set of centers C - For each remaining feature vector xi in the dataset X, the algorithm calculates its minimal distance d(xi, C) from the current set of centers C. This distance represents how close or far the feature vector xi is from the existing cluster centers.

 

After these steps, below steps would be followed:

  • For each remaining feature vector xi in the dataset X, the algorithm calculates its minimal distance d(xi, C) from the current set of centers C. This distance represents how close or far the feature vector xi is from the existing cluster centers.
  • The algorithm selects the next center from the remaining feature vectors with a probability that is proportional to the square of the distance between the feature vector xi and its nearest center in C.
  • The algorithm continues steps 2 to 4 until the desired number of centers is reached. Try changing the dataset or few values in the dataset so that you would get different values everytime you run. And, Centroid initialization might not always be random.

 

If this helps with your doubts, make sure to accept this as a solution. This helps others with similar issue. Thank you!

 

Regards,

Alekhya

 

0 Kudos
JohnZhang
New Contributor I
650 Views

Many thanks for the explanation. I appreciate that the first center is chosen randomly with a uniform probability. I was a bit surprised to see that if I run the algorithm repeatedly with the same training data, the first center remains exactly the same every time, and subsequently the rest of the centers. This makes me wonder the true randomness of the first center. I guess internally, it is a pseudo random number generator that leads to repeatability of the centers?

Anyhow, it makes sense if we would like consistency every time. Please confirm if my understanding is correct.

Most importantly, it will really help to allow user to specify the "first" center or some initial centers. In many applications, some obvious initial guess can be calculated rather than randomly chosen.

 

Kind regards

John

0 Kudos
AlekhyaV_Intel
Moderator
540 Views

Hi,

 

Apologies for the delay. We went through the code & couldn't find any pseudo random generator setting a seed by default. This is an expected behaviour. Try changing some values in dataset for different results. You could also change cluster count & number of iterations for different results.

 

And regarding your second question about users setting the centroid value, thanks for your suggestion. As of now, kmeans sample in oneDAL doesn't allow users to choose initial centers or one specific center.

 

As we have addressed all your doubts, Can we go ahead and close this case? And, please accept this as a solution so that it helps others with similar queries. Thank you!

 

Regards,

Alekhya

 

0 Kudos
AlekhyaV_Intel
Moderator
391 Views

Hi,


We assume that your issue is resolved. If you need any further information, please post a new question as this thread will no longer be monitored by Intel.


Regards,

Alekhya


0 Kudos
Reply