Problem.
Given a set of 32000 headlines and their labels, whether that headline is a clickbait (label 1) or
not (label 0), you’re asked to build a model to detect clickbait headlines.
Solution.
Read data: …
Split into train/validation/test sets: …
And finally, we build and train/tune a pipeline model - cfr_pipeline -
that uses LogisticRegression classifier. The model’s macro precision on
test set is 0.9650.
Manual Boosting.
First find the mislabeled samples in the training set:
There are 432 such samples. Let’s prepare and add them to the training set. In other words
let’s manually boost our training set:
Remember, the train_measure_model was defined as:
Now the macro precision on test set is 0.9661.
Let’s boost it one more time:
Now the macro precision on test set is 0.9664. We stop here since continuing
more will not add improve the test metrics and at the same time will
start to overfit.
Important Points.
In this example we see a slight (<1%) however, similar to other techniques,
depending on the problem, it might improve more or worsen the metrics.
This technique can be applied to other models, conventional or TL or DL.
It can be applied to other classification types as well, i.e. multi-label or multi-class
classifications.
This technique is not restricted to NLP.
Overall, it’s a good idea to check the quality of the mislabeled samples
in terms of labels as bad-labeling/inconsistent-labeling is higher in this set.
(will explain this in another post)