Detecting Clickbaits (3/4) - Logistic Regression

image info

Problem. Given a set of 32000 headlines and their labels, whether that headline is a clickbait (label 1) or not (label 0), you’re asked to build a model to detect clickbait headlines.

Solution.

Read data:

df = pd.read_csv("https://raw.github.com/hminooei/DSbyHadi/master/data/clickbait_data.csv.zip")
df.tail(2)

image info

df.head(2)

image info

Split into train/validation/test sets:

text_train_val, text_test, label_train_val, label_test = train_test_split(
    df["headline"], 
    df["clickbait"], 
    test_size=0.25, 
    stratify=df["clickbait"], 
    random_state=9)

# Split the train_val dataset to train and validation separete portions.
text_train, text_val, label_train, label_val = train_test_split(
    text_train_val,
    label_train_val, 
    test_size=0.2, 
    random_state=9)

Define a function that builds a pipeline line consisting of CountVectorizer, TfidfTransformer (note that you can combine these two and use TfidfVectorizer), and LogisticRegression stages so that you can pass different parameters to it for tuning:

def train_measure_model(text_train, label_train, text_val, label_val,
                        cv_binary, cv_analyzer, cv_ngram, cv_max_features,
                        cv_have_tfidf, cv_use_idf, cfr_penalty, cfr_C, stop_words=None, 
                        text_column_name="headline"):
    cv = CountVectorizer(binary=cv_binary, stop_words=stop_words,
                               analyzer=cv_analyzer,
                               ngram_range=cv_ngram[1:3],
                               max_features=cv_max_features)
    if cv_have_tfidf:
        pipeline = Pipeline(steps=[("vectorizer", cv), 
                                   ("tfidf", TfidfTransformer(use_idf=cv_use_idf)),
                                   ("classifier", LogisticRegression(penalty=cfr_penalty,
                                                                     C=cfr_C,
                                                                     random_state=9,
                                                                     max_iter=100,
                                                                     n_jobs=None))])
    else:
        pipeline = Pipeline(steps=[("vectorizer", cv), 
                                   ("classifier", LogisticRegression(penalty=cfr_penalty,
                                                                     C=cfr_C,
                                                                     random_state=9,
                                                                     max_iter=100,
                                                                     n_jobs=None))])

    pipeline.fit(text_train, label_train)
    
    print_metrics(pipeline, text_train, label_train, text_val, label_val)

    return pipeline

where the evaluation section is refactored into print_metrics:

def print_metrics(pipeline, text_train, label_train, text_val, label_val):
    train_preds = pipeline.predict(text_train)
    val_preds = pipeline.predict(text_val)
    
    print("train:")
    print(metrics.classification_report(label_train, train_preds, labels=[0, 1], digits=4))
    print(metrics.confusion_matrix(label_train, train_preds))
    print("validation:")
    print(metrics.classification_report(label_val, val_preds, labels=[0, 1], digits=4))
    print(metrics.confusion_matrix(label_val, val_preds))

Now, we run the function with a few different parameters (we tried 4 sets) to reach the trained model below:

cfr_pipeline = train_measure_model(text_train, label_train, text_val, label_val,
                                   cv_binary=True, cv_analyzer="word", cv_ngram=("w", 1, 3), 
                                   cv_max_features=5000, cv_have_tfidf=True, cv_use_idf=True, 
                                   cfr_penalty="l2", cfr_C=1.0, stop_words=None)

which can be tested against test set:

measure_model_on_test(cfr_pipeline, text_test, label_test)

image info

Please see the next post Detecting Clickbaits (4/4) - Manual Boosting for further improvement of this model.

Important Points.

The training time: 1.3s per cycle (on my laptop), and since I did 4 cycles to search for parameters it took 1 minutes overall.
Macro precision on test set: 0.9650
Inference time per record: ~1ms on my laptop (MacBook Pro: 2.3 GHz 8-Core Intel Core i9, 32 GB 2667 MHz DDR4)

Note.

The complete code for this post can be found at GitHub
that this one solution, please refer to the next posts for other possible solutions!
The dataset was originally taken from Kaggle.