I am trying to evaluate performance of a few machine learning classifiers using the recent beta version of Python. The classifier is the random forest algorithm from sci-kit learn and I am interested in training the model in parallel. So far setting number of tasks via njobs does not seem to work: running top does not show any activity on the rest of the cores. Is there something else that needs to be set to enable actual parallel training using scikit? Any pointers or advice to get this working?
Thanks in advance,
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
sless_drive = pd.read_csv('datasets/Sensorless_drive_diagnosis.txt', sep=" ", header = None)
df = pd.DataFrame(sless_drive)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['class'] = pd.Categorical(df) #Column 48 is the class label
train, test = df[df['is_train']==True], df[df['is_train']==False] #Separating train and test subsets
features = df.columns[:47] #Only X variables
start = time.time()
clf = RandomForestClassifier(n_jobs=3, verbose=3)
end = time.time()
y, _ = pd.factorize(train['class'])
clf.fit(train[features], y) #Training the random forest
preds = clf.predict(test[features])
pd.crosstab(test['class'], preds, rownames=['actual'], colnames=['preds'])
Thanks for trying our distribution! Looking at the sample code that you've provided, you should be timing the fitting of prediction model as opposed to creation of classifier. I suggest that you :
start = time.time() clf.fit(train[features], y) #Training the random forest end = time.time() print(end-start)
start = time.time() clf.fit(train[features], y) #Training the random forest preds = clf.predict(test[features]) end = time.time() print(end-start)
Moreover, if the time difference does not vary with changing the n_jobs parameter to RandomForestClassifier, it would really help our investigation if you provided "datasets/Sensorless_drive_diagnosis.txt".
Also, as per the documentation, you can scale the computation to the number of cores by setting the n_jobs field to -1.