image info

Problem. Given a set of 32000 headlines and their labels, whether that headline is a clickbait (label 1) or not (label 0), you’re asked to build a model to detect clickbait headlines.

Solution.

Read data:

df = pd.read_csv("https://raw.github.com/hminooei/DSbyHadi/master/data/clickbait_data.csv.zip")
df.tail(2)

image info

df.head(2)

image info

Split into train/validation/test sets:

text_train_val, text_test, label_train_val, label_test = train_test_split(
    df["headline"], 
    df["clickbait"], 
    test_size=0.25, 
    stratify=df["clickbait"], 
    random_state=9)

# Split the train_val dataset to train and validation separete portions.
text_train, text_val, label_train, label_val = train_test_split(
    text_train_val,
    label_train_val, 
    test_size=0.2, 
    random_state=9)

Define a function that builds a pipeline line consisting of CountVectorizer, TfidfTransformer (note that you can combine these two and use TfidfVectorizer), and LogisticRegression stages so that you can pass different parameters to it for tuning:

def train_measure_model(text_train, label_train, text_val, label_val,
                        cv_binary, cv_analyzer, cv_ngram, cv_max_features,
                        cv_have_tfidf, cv_use_idf, cfr_penalty, cfr_C, stop_words=None, 
                        text_column_name="headline"):
    cv = CountVectorizer(binary=cv_binary, stop_words=stop_words,
                               analyzer=cv_analyzer,
                               ngram_range=cv_ngram[1:3],
                               max_features=cv_max_features)
    if cv_have_tfidf:
        pipeline = Pipeline(steps=[("vectorizer", cv), 
                                   ("tfidf", TfidfTransformer(use_idf=cv_use_idf)),
                                   ("classifier", LogisticRegression(penalty=cfr_penalty,
                                                                     C=cfr_C,
                                                                     random_state=9,
                                                                     max_iter=100,
                                                                     n_jobs=None))])
    else:
        pipeline = Pipeline(steps=[("vectorizer", cv), 
                                   ("classifier", LogisticRegression(penalty=cfr_penalty,
                                                                     C=cfr_C,
                                                                     random_state=9,
                                                                     max_iter=100,
                                                                     n_jobs=None))])

    pipeline.fit(text_train, label_train)
    
    print_metrics(pipeline, text_train, label_train, text_val, label_val)

    return pipeline

where the evaluation section is refactored into print_metrics:

def print_metrics(pipeline, text_train, label_train, text_val, label_val):
    train_preds = pipeline.predict(text_train)
    val_preds = pipeline.predict(text_val)
    
    print("train:")
    print(metrics.classification_report(label_train, train_preds, labels=[0, 1], digits=4))
    print(metrics.confusion_matrix(label_train, train_preds))
    print("validation:")
    print(metrics.classification_report(label_val, val_preds, labels=[0, 1], digits=4))
    print(metrics.confusion_matrix(label_val, val_preds))

Now, we run the function with a few different parameters (we tried 4 sets) to reach the trained model below:

cfr_pipeline = train_measure_model(text_train, label_train, text_val, label_val,
                                   cv_binary=True, cv_analyzer="word", cv_ngram=("w", 1, 3), 
                                   cv_max_features=5000, cv_have_tfidf=True, cv_use_idf=True, 
                                   cfr_penalty="l2", cfr_C=1.0, stop_words=None)

which can be tested against test set:

measure_model_on_test(cfr_pipeline, text_test, label_test)

image info

Please see the next post Detecting Clickbaits (4/4) - Manual Boosting for further improvement of this model.

Important Points.

  • The training time: 1.3s per cycle (on my laptop), and since I did 4 cycles to search for parameters it took 1 minutes overall.
  • Macro precision on test set: 0.9650
  • Inference time per record: ~1ms on my laptop (MacBook Pro: 2.3 GHz 8-Core Intel Core i9, 32 GB 2667 MHz DDR4)

Note.

  • The complete code for this post can be found at GitHub
  • that this one solution, please refer to the next posts for other possible solutions!
  • The dataset was originally taken from Kaggle.