BERT Text Classification Deep Dive

🌟 What is BERT?

BERT (Bidirectional Encoder Representations from Transformers) is a revolutionary NLP model introduced by Google in 2018. Unlike traditional models that process text sequentially, BERT is bidirectional, capturing context from both left and right directions simultaneously. This makes it exceptionally powerful for tasks like text classification, question answering, and more.

Key features of BERT:

Bidirectional Context: Understands words in context by looking at the entire sentence.
Pre-training: Trained on massive datasets (e.g., Wikipedia) using Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).
Fine-tuning: Adapted to specific tasks like sentiment analysis with minimal additional training.

🔑 Why BERT?

BERT’s ability to understand nuanced language patterns makes it ideal for tasks requiring deep contextual understanding, such as classifying movie reviews as positive or negative.

🏗️ BERT Architecture

BERT is built on the Transformer’s encoder architecture, stacking multiple layers of interconnected nodes. Each layer includes:

Multi-Head Self-Attention: Computes attention scores to focus on relevant words, defined as:

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V \]

Where \( Q \) (query), \( K \) (key), and \( V \) (value) are vectors, and \( d_k \) is the dimension of the keys.

Feed-Forward Networks: Applies dense layers to each token’s representation.
Layer Normalization: Stabilizes training by normalizing outputs.

BERT’s input includes tokenized text with special tokens:

[CLS]: Added at the start, its output is used for classification tasks.
[SEP]: Separates sentences or marks the end of a sequence.

🔑 BERT’s Power

The bidirectional nature and pre-training allow BERT to capture deep semantic relationships, making it highly effective for downstream tasks.

🔍 BERT for Text Classification

BERT excels in text classification by leveraging its contextual embeddings. For sentiment analysis, we fine-tune BERT on a labeled dataset of movie reviews labeled as positive or negative. Below is a detailed breakdown of the process:

1

Data Preparation

The dataset consists of movie reviews with binary labels (1 for positive, 0 for negative). We tokenize the text using BERT’s tokenizer, which converts words into subword tokens and adds special tokens like [CLS] and [SEP].

texts = [
    "This movie is great and I loved it!",
    "Terrible film, very boring.",
    ...
]
labels = [1, 0, ...]
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The tokenizer outputs input_ids (token indices) and attention_mask (indicating valid tokens vs. padding).

2

Model Architecture

We use the pre-trained bert-base-uncased model and add a classification head. The [CLS] token’s output (pooled_output) is passed through a dropout layer and a dense layer with softmax activation for binary classification.

bert_model = TFBertModel.from_pretrained('bert-base-uncased')
input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="attention_mask")
bert_outputs = bert_model(input_ids, attention_mask=attention_mask)
pooled_output = bert_outputs.pooler_output
dropout = tf.keras.layers.Dropout(0.3)(pooled_output)
output = tf.keras.layers.Dense(2, activation='softmax')(dropout)
model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)

The softmax layer outputs probabilities for positive and negative classes.

3

Training

The model is fine-tuned with a small learning rate (e.g., 2e-5) using the Adam optimizer and sparse categorical crossentropy loss. We train for 10 epochs on batched data.

model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=10,
    verbose=2
)

⚠️ Training Tips

Small Learning Rate: BERT requires small learning rates (e.g., 2e-5) to avoid catastrophic forgetting of pre-trained weights.

Batch Size: Small batches (e.g., 2) are used due to BERT’s memory demands.

Epochs: 3-10 epochs are often sufficient for fine-tuning on small datasets.

4

Inference

For inference, we tokenize a new text, pass it through the model, and predict the sentiment based on the highest probability class.

def predict_sentiment(text, model, tokenizer, max_len=128):
    encodings = tokenizer(
        [text],
        max_length=max_len,
        padding='max_length',
        truncation=True,
        return_tensors='tf'
    )
    probs = model({'input_ids': encodings['input_ids'], 'attention_mask': encodings['attention_mask']}).numpy()
    prediction = np.argmax(probs, axis=-1)[0]
    print(f"Probabilities: {probs}")
    return "Positive" if prediction == 1 else "Negative"

5

Interactive Demo

Test the sentiment analysis model with your own text:

🔄 Sentiment Analysis Demo

Sentiment will appear here...

Note: This demo uses a simple keyword-based simulation. The actual model uses BERT’s contextual understanding for more accurate predictions.

🚀 Complete BERT Classification Code

This section provides the complete Python script for fine-tuning BERT for sentiment analysis on movie reviews.

import tensorflow as tf
from transformers import TFBertModel, BertTokenizer
from sklearn.model_selection import train_test_split
import numpy as np

# Dataset
texts = [
    "This movie is great and I loved it!",
    "Terrible film, very boring.",
    "Amazing storyline and acting!",
    "I didn't enjoy this at all.",
    "Fantastic experience, highly recommend!",
    "Really bad, waste of time.",
    "One of the best movies I've seen this year!",
    "Completely disappointing and predictable.",
    "Brilliant direction and stunning visuals.",
    "I fell asleep halfway through, so dull.",
    "Heartwarming and beautifully shot.",
    "Poor acting and weak script.",
    "Absolutely loved the plot twists!",
    "Not worth the hype at all.",
    "Engaging from start to finish!",
    "The worst film I’ve ever watched.",
    "Incredible performances by the cast!",
    "Script was a mess and pacing was off."
]
labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0]

# Train/Val Split
train_texts, val_texts, train_labels, val_labels = train_test_split(
    texts, labels, test_size=0.2, random_state=42
)

# Load tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenization function
def tokenize_texts(texts, max_len=128):
    encodings = tokenizer(
        texts,
        max_length=max_len,
        padding='max_length',
        truncation=True,
        return_tensors='tf'
    )
    return encodings['input_ids'], encodings['attention_mask']

# Tokenize data
train_input_ids, train_attention_mask = tokenize_texts(train_texts)
val_input_ids, val_attention_mask = tokenize_texts(val_texts)

# Convert labels to tensors
train_labels = tf.convert_to_tensor(train_labels, dtype=tf.int32)
val_labels = tf.convert_to_tensor(val_labels, dtype=tf.int32)

# Prepare datasets
train_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': train_input_ids, 'attention_mask': train_attention_mask},
    train_labels
)).batch(2).prefetch(tf.data.AUTOTUNE)

val_dataset = tf.data.Dataset.from_tensor_slices((
    {'input_ids': val_input_ids, 'attention_mask': val_attention_mask},
    val_labels
)).batch(2).prefetch(tf.data.AUTOTUNE)

# Load base BERT model
bert_model = TFBertModel.from_pretrained('bert-base-uncased')

# Input layers
input_ids = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="input_ids")
attention_mask = tf.keras.layers.Input(shape=(None,), dtype=tf.int32, name="attention_mask")

# Get pooled output from BERT
bert_outputs = bert_model(input_ids, attention_mask=attention_mask)
pooled_output = bert_outputs.pooler_output

# Classification head
dropout = tf.keras.layers.Dropout(0.3)(pooled_output)
output = tf.keras.layers.Dense(2, activation='softmax')(dropout)

# Build model
model = tf.keras.Model(inputs=[input_ids, attention_mask], outputs=output)

# Compile model
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=2e-5),
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)

# Train the model
model.fit(
    train_dataset,
    validation_data=val_dataset,
    epochs=10,
    verbose=2
)

# Inference function
def predict_sentiment(text, model, tokenizer, max_len=128):
    encodings = tokenizer(
        [text],
        max_length=max_len,
        padding='max_length',
        truncation=True,
        return_tensors='tf'
    )
    probs = model({'input_ids': encodings['input_ids'], 'attention_mask': encodings['attention_mask']}).numpy()
    prediction = np.argmax(probs, axis=-1)[0]
    print(f"Probabilities: {probs}")
    return "Positive" if prediction == 1 else "Negative"

# Test inference
test_text = "This is an awesome movie!"
result = predict_sentiment(test_text, model, tokenizer)
print(f"\nText: {test_text}")
print(f"Predicted Sentiment: {result}")

1

Running the Code

Follow these steps to run the BERT model locally:

Install Dependencies: Ensure you have TensorFlow, Transformers, and scikit-learn installed:
```
pip install tensorflow transformers scikit-learn numpy
```
Save the Code: Copy the complete code into a file named bert_sentiment.py.
Run the Script: Execute the script using Python:
```
python bert_sentiment.py
```
Expected Output: The script will fine-tune BERT, display training progress, and output the sentiment for a test sentence.

🔑 Code Features

Pre-trained BERT: Leverages bert-base-uncased for robust embeddings.
Fine-tuning: Adapts BERT to sentiment analysis with minimal code.
Efficient Data Pipeline: Uses TensorFlow datasets with batching and prefetching.
Interactive Inference: Allows testing custom sentences for sentiment prediction.