For some NLP models, when we pickle the trained model, it’s size on disk is too large, regardless of the limited number of features used in it, and this will take so much memory at the inference time.
Before jumping to solution, let’s see an example.
As an example, let’s look at Detecting Clickbaits (3/4) - Logistic Regression.
There, the NLP is trying to use a set of
32000 headlines and their labels, whether that headline is a clickbait (label
0), to build a model to detect clickbait headlines. Let’s remind ourselves
Split into train/validation/test sets:
Define a function that builds a pipeline line consisting of
TfidfTransformer (note that you can combine these two and use
LogisticRegression stages so that you can pass different parameters to it
and finally training a pipeline model with the following parameters:
If we just pickle
its size on disk would be
5.9 MB although we are only using
5000 features (which are
The reason is
stop_words_ attribute in
CountVectorizer. Looking at
stop_words_ : set
Terms that were ignored because they either:
- occurred in too many documents (
- occurred in too few documents (
- were cut off by feature selection (
This is only available if no vocabulary was given.
stop_words_attribute can get large and increase the model size
when pickling. This attribute is provided only for introspection and can
be safely removed using delattr or set to None before pickling.
So to resolve the size issue, we can set the
And now the size of pickled model is
Note that in this case, our initial pickled model size was not that large (
5.9 MB only)
because our training texts are small (headlines are a few words only) and hence the set
of all 1-3 ngrams is not too large.
However in case of larger text bodies in the training set (reviews/tweets/news/etc.),
the size of pickled model can easily get over
1-2 GB even if the number of features
1000 which takes too long for the model
to load into memory and multiple Gigs of RAM at the inference stage. Whereas by using
the above solution, its size shrinks to less than