- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hello,
I am trying to evaluate performance of a few machine learning classifiers using the recent beta version of Python. The classifier is the random forest algorithm from sci-kit learn and I am interested in training the model in parallel. So far setting number of tasks via njobs does not seem to work: running top does not show any activity on the rest of the cores. Is there something else that needs to be set to enable actual parallel training using scikit? Any pointers or advice to get this working?
Thanks in advance,
Steena
#####Code snippet#####
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
import time
sless_drive = pd.read_csv('datasets/Sensorless_drive_diagnosis.txt', sep=" ", header = None)
df = pd.DataFrame(sless_drive)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['class'] = pd.Categorical(df[48]) #Column 48 is the class label
train, test = df[df['is_train']==True], df[df['is_train']==False] #Separating train and test subsets
features = df.columns[:47] #Only X variables
start = time.time()
clf = RandomForestClassifier(n_jobs=3, verbose=3)
end = time.time()
print (end-start)
y, _ = pd.factorize(train['class'])
clf.fit(train[features], y) #Training the random forest
preds = clf.predict(test[features])
pd.crosstab(test['class'], preds, rownames=['actual'], colnames=['preds'])
Link Copied
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Hi Steena,
Thanks for trying our distribution! Looking at the sample code that you've provided, you should be timing the fitting of prediction model as opposed to creation of classifier. I suggest that you :
start = time.time() clf.fit(train[features], y) #Training the random forest end = time.time() print(end-start)
or
start = time.time() clf.fit(train[features], y) #Training the random forest preds = clf.predict(test[features]) end = time.time() print(end-start)
Moreover, if the time difference does not vary with changing the n_jobs parameter to RandomForestClassifier, it would really help our investigation if you provided "datasets/Sensorless_drive_diagnosis.txt".
Also, as per the documentation, you can scale the computation to the number of cores by setting the n_jobs field to -1.
Thanks,
Rohit
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
Rohit,
Thanks!! After moving the timing calls, I do see a reduction in training time that scales with number of cores.
Best,
Steena
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page