- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Can oneDal developer clarify the kmeans++ initialization algorithm. In document, it says, the first centroid is selected randomly. But when I run the algorithm repeatedly and the centroids seem to remain the same every time. I wonder if it is because of the random number generator inside the function is seeded the same way every time?

This is the link to the function and I used *plus_plus_dense* option.

Link Copied

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Thanks for posting in Intel Communities. We are reproducing the issue from our end. Meanwhile, could you please let us know the oneAPI version & your system information so that we can debug your issue further?

Regards,

Alekhya

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

I have oneDAL version 2023.2.0, running on Visual Studio 2022/Window 10. Thanks

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Also, as a suggestion, it will be really useful to have an option to allow user to specify only the FIRST centroid and then let the plus_plus algorithm to calculate the rest.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

We apologize for the delay. We have contacted the admin team regarding this issue and got an update that centroid initialization is it not always random. Please refer the below link:

It is based on the function and not on the randomness. Kindly change the dataset to 750 rather than 1000 rows which will help you to debug, run multiple times and check if the centroid changes or not.

If not then it is the function which is causing the issue not the randomness, kindly debug your code and check where is the issue.

If this resolves your issue, make sure to accept this as solution. This helps others with similar issues.

Regards,

Alekhya

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Has the solution provided helped? Could you please give us an update regarding this issue?

Regards,

Alekhya

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Many thanks for the rely. Could you elaborate on point 1 referred in the link? What exactly it means by "equal probability", does it mean every time, the same first center will be chosen? Thanks

K-Means++ algorithm [Arthur2007], which selects centers with the probability proportional to their contribution to the overall error ??) according to the following scheme:

Chooses one of the feature vectors ?? from ? with equal probability.

Excludes ?? from ? and adds it to the current set of centers ?.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Thanks for getting back to us. Let me elaborate this to you.

As per your first query regarding equal probability, **equal probability** means that each feature vector in the dataset has the same chance of being selected during the random choice process. Regardless of the characteristics of the vectors or the frequency of the vector x being chosen from dataset X, there is no bias towards any particular vector; they all have an identical probability of being chosen. As there's always a single dataset & every time first k feature vectors are considered and first vector x is randomly chosen from dataset X, you might be getting same centroid values everytime you run the sample. You could try changing some values in the dataset to see a different output or pass a different dataset.

Regarding the second part of your query, please find the response inline:

K-Means++ algorithm [Arthur2007], which selects centers with the probability proportional to their contribution to the overall error ??) according to the following scheme:

- Chooses one of the feature vectors xi from dataset X with equal probability -
**The K-Means++ algorithm starts by selecting the first center. This initial center is chosen from the set of feature vectors (data points) X with equal probability. Essentially, any of the feature vectors has an equal chance of being chosen as the first center.** - Excludes xi from X and adds it to the current set of centers C -
**For each remaining feature vector xi in the dataset X, the algorithm calculates its minimal distance d(xi, C) from the current set of centers C. This distance represents how close or far the feature vector xi is from the existing cluster centers.**

After these steps, below steps would be followed:

- For each remaining feature vector xi in the dataset X, the algorithm calculates its minimal distance d(xi, C) from the current set of centers C. This distance represents how close or far the feature vector xi is from the existing cluster centers.
- The algorithm selects the next center from the remaining feature vectors with a probability that is proportional to the square of the distance between the feature vector xi and its nearest center in C.
- The algorithm continues steps 2 to 4 until the desired number of centers is reached. Try changing the dataset or few values in the dataset so that you would get different values everytime you run.
**And, Centroid initialization might not always be random.**

If this helps with your doubts, make sure to accept this as a solution. This helps others with similar issue. Thank you!

Regards,

Alekhya

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Many thanks for the explanation. I appreciate that the first center is chosen randomly with a uniform probability. I was a bit surprised to see that if I run the algorithm repeatedly with the same training data, the first center remains exactly the same every time, and subsequently the rest of the centers. This makes me wonder the true randomness of the first center. I guess internally, it is a pseudo random number generator that leads to repeatability of the centers?

Anyhow, it makes sense if we would like consistency every time. Please confirm if my understanding is correct.

Most importantly, it will really help to allow user to specify the "first" center or some initial centers. In many applications, some obvious initial guess can be calculated rather than randomly chosen.

Kind regards

John

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

Apologies for the delay. We went through the code & couldn't find any pseudo random generator setting a seed by default. This is an expected behaviour. Try changing some values in dataset for different results. You could also change cluster count & number of iterations for different results.

And regarding your second question about users setting the centroid value, thanks for your suggestion. As of now, kmeans sample in oneDAL doesn't allow users to choose initial centers or one specific center.

As we have addressed all your doubts, Can we go ahead and close this case? And, please accept this as a solution so that it helps others with similar queries. Thank you!

Regards,

Alekhya

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content

Hi,

We assume that your issue is resolved. If you need any further information, please post a new question as this thread will no longer be monitored by Intel.

Regards,

Alekhya

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page