parallel random forest (scikit-learn)

Steena_M_Intel · ‎06-10-2016

Hello,

I am trying to evaluate performance of a few machine learning classifiers using the recent beta version of Python. The classifier is the random forest algorithm from sci-kit learn and I am interested in training the model in parallel. So far setting number of tasks via njobs does not seem to work: running top does not show any activity on the rest of the cores. Is there something else that needs to be set to enable actual parallel training using scikit? Any pointers or advice to get this working?

Thanks in advance,

Steena

#####Code snippet#####

from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time

sless_drive = pd.read_csv('datasets/Sensorless_drive_diagnosis.txt', sep=" ", header = None)
df = pd.DataFrame(sless_drive)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['class'] = pd.Categorical(df[48]) #Column 48 is the class label

train, test = df[df['is_train']==True], df[df['is_train']==False] #Separating train and test subsets
features = df.columns[:47] #Only X variables
start = time.time()
clf = RandomForestClassifier(n_jobs=3, verbose=3)
end = time.time()
print (end-start)
y, _ = pd.factorize(train['class'])
clf.fit(train[features], y) #Training the random forest
preds = clf.predict(test[features])
pd.crosstab(test['class'], preds, rownames=['actual'], colnames=['preds'])

Rohit_J_Intel · ‎06-13-2016

Hi Steena,

Thanks for trying our distribution! Looking at the sample code that you've provided, you should be timing the fitting of prediction model as opposed to creation of classifier. I suggest that you :

start = time.time()
clf.fit(train[features], y) #Training the random forest
end = time.time()
print(end-start)

or

start = time.time()
clf.fit(train[features], y) #Training the random forest
preds = clf.predict(test[features])
end = time.time()
print(end-start)

Moreover, if the time difference does not vary with changing the n_jobs parameter to RandomForestClassifier, it would really help our investigation if you provided "datasets/Sensorless_drive_diagnosis.txt".

Also, as per the documentation, you can scale the computation to the number of cores by setting the n_jobs field to -1.

Thanks,
Rohit

Steena_M_Intel · ‎06-13-2016

Rohit,

Thanks!! After moving the timing calls, I do see a reduction in training time that scales with number of cores.

Best,

Steena