Memory issues while implementing Gridsearch-SVC, NuSVC algorithms using Sklearnex

Swatinairl · ‎08-18-2022

We are facing memory issues on D4_v5 machine while implementing hyperparameter tuning with Gridsearch for SVC and NUSVC using Sklearnex for dataset with rows above 400k . Please suggest suitable soln.

AthiraM_Intel · ‎08-19-2022

Hi,

Thank you for posting in Intel Communities.

Could you please share the following details?

Sample reproducer code
Exact steps and the commands used
OS details
Dataset you used

Thanks

Swatinairl · ‎08-24-2022

Hi ,

Ref notebook:Network Intrusion Detection using Python | Kaggle

Below is the code snippet for Grid search where we are facing issues:

from sklearn.model_selection import GridSearchCV

train_original = pd.read_csv("data.csv")

train = train_original.head(500000)

# Attack Class Distribution
train['label'].value_counts()

# # SCALING NUMERICAL ATTRIBUTES

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# extract numerical attributes and scale it to have zero mean and unit variance
cols = train.select_dtypes(include=['float64', 'int64']).columns
sc_train = scaler.fit_transform(
train.select_dtypes(include=['float64', 'int64']))
'''sc_test = scaler.fit_transform(
test.select_dtypes(include=['float64', 'int64']))'''

# turn the result back to a dataframe
sc_traindf = pd.DataFrame(sc_train, columns=cols)
#sc_testdf = pd.DataFrame(sc_test, columns=cols)

# # ENCODING CATEGORICAL ATTRIBUTES
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()

# extract categorical attributes from both training and test sets
cattrain = train.select_dtypes(include=['object']).copy()
# encode the categorical attributes
traincat = cattrain.apply(encoder.fit_transform)
# separate target column from encoded data
enctrain = traincat.drop(['label'], axis=1)
cat_Ytrain = traincat[['label']].copy()
train_x = pd.concat([sc_traindf, enctrain], axis=1)
train_y = train['label']
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(
train_x, train_y, train_size=0.70, random_state=2)
print("data prep time is ---->", time.time()-start_time_data_prep)

logging.debug("Training with NuSVC")
tuned_parameters = [
{"kernel": ["rbf","poly"], "gamma": ["scale"]}]
score = "recall"
clf = GridSearchCV(NuSVC(nu=0.2), tuned_parameters, n_jobs=-1,
scoring="%s_macro" % score, cv=5, verbose=10)
start_time_nusvc=time.time()
clf.fit(X_train, Y_train)
print("best params",clf.best_params_)
print("best score ",clf.best_score_)

os details:

Distributor ID: Ubuntu
Description: Ubuntu 20.04.4 LTS
Release: 20.04
Codename: focal

The dataset/.csv file is attached here

AthiraM_Intel · ‎08-30-2022

Hi,

We are able to run the sample code you shared without any issues on ubuntu 18(Intel DevCloud).

Could you please try to run the same using Intel DevCloud for oneAPI ?

You can register for DevCloud using the below link:

https://www.intel.com/content/www/us/en/forms/idz/devcloud-enrollment/oneapi-request.html

Meanwhile could you please share the hardware details in which you tried already, so that we can try to reproduce your issue.

Thanks

AthiraM_Intel · ‎09-07-2022

Hi,

We have not heard back from you. Could you please give us an update?

Thanks

AthiraM_Intel · ‎09-13-2022

Hi,

We have not heard back from you. This thread will no longer be monitored by Intel. If you need further assistance, please post a new question.

Thanks