Intel® Distribution for Python*
Engage in discussions with community peers related to Python* applications and core computational packages.
431 Discussions

parallel random forest (scikit-learn)

Steena_M_Intel
Employee
2,947 Views

Hello,

I am trying to evaluate performance of a few machine learning classifiers using the recent beta version of Python. The classifier is the random forest algorithm from sci-kit learn and I am interested in training the model in parallel. So far setting number of tasks via njobs does not seem to work: running top does not show any activity on the rest of the cores. Is there something else that needs to be set to enable actual parallel training using scikit?  Any pointers or advice to get this working?

Thanks in advance,

Steena                                                                                                                                                                                                                                

#####Code snippet#####

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time

sless_drive = pd.read_csv('datasets/Sensorless_drive_diagnosis.txt', sep=" ", header = None)
df = pd.DataFrame(sless_drive)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['class'] = pd.Categorical(df[48]) #Column 48 is the class label

train, test = df[df['is_train']==True], df[df['is_train']==False] #Separating train and test subsets
features = df.columns[:47] #Only X variables
start = time.time()
clf = RandomForestClassifier(n_jobs=3, verbose=3)
end = time.time()
print (end-start)
y, _ = pd.factorize(train['class'])
clf.fit(train[features], y) #Training the random forest
preds = clf.predict(test[features])
pd.crosstab(test['class'], preds, rownames=['actual'], colnames=['preds'])

 

 

 

 

 

0 Kudos
2 Replies
Rohit_J_Intel
Employee
2,947 Views

Hi Steena,

Thanks for trying our distribution! Looking at the sample code that you've provided, you should be timing the fitting of prediction model as opposed to creation of classifier. I suggest that you :

start = time.time()
clf.fit(train[features], y) #Training the random forest
end = time.time()
print(end-start)

or 

start = time.time()
clf.fit(train[features], y) #Training the random forest
preds = clf.predict(test[features])
end = time.time()
print(end-start)

Moreover, if the time difference does not vary with changing the n_jobs parameter to RandomForestClassifier, it would really help our investigation if you provided "datasets/Sensorless_drive_diagnosis.txt".

Also, as per the documentation, you can scale the computation to the number of cores by setting the n_jobs field to -1.

Thanks,
Rohit

 
0 Kudos
Steena_M_Intel
Employee
2,947 Views

Rohit,

Thanks!! After moving the timing calls, I do see a reduction in training time that scales with number of cores.

Best,

Steena

0 Kudos
Reply