Z is a summer intern working on spam classification in your company. The dataset consists of 10 million non-spam emails (class 0) and 10 thousand spam emails (class 1). Z considers the following steps of conducting experiments:

Step 1: Shuffle the dataset and split it into the train, validation, and test sets.
Step 2: Train logistic regression models on the train set with different hyper-parameters.
Step 3: Identify the best hyper-parameter using the validation set and report the results on the test set in accuracy.

Do you agree with the above experimental setup? If No, what is the major issue? Provide your suggestions in one or two sentences.

Respuesta :

Answer:

No, the data points of class 0 and class 1 are imbalanced and the text should be converted to a vector before used

Explanation:

The non-spam emails of class 0 have 10 million rows of data while class 1 of spam emails have 10 thousand rows. The data points are imbalanced and would result in an inaccurate prediction for the model. Either class 0 be downsampled or class 1 be upsampled to improve the prediction of the model.

The text of the emails should also be converted to vectors before using it in the model using natural language processing (NLP) techniques.

ACCESS MORE
EDU ACCESS