In this script, we deal with text classification tasks with PyTorch. Using the YelpReviewPolarity database, we train a classifier with an embedding.bag hidden layer and a Linear layer predicting βbadβ and βgoodβ reviews. In terms of process, we carry out the following steps:
creating a stream of data reading
creating a vocabulary
pre-processing reviews
data loading
model creation
model training and evaluation
In the first step, we inherit the data from torchtext.datasets . The training dataset contains 560K reviews, while the test dataset has 38K reviews. For further information about the data, please see: https://huggingface.co/datasets/yelp_polarity
Python
Copy
>>> from torchtext.datasets import YelpReviewPolarity
When it comes iterating over the data, we can use PyTorchβs iter(), included in torch.utils.data, which creates a generator object β i.e., a stream of data reading.
Python
Copy
>>> train_iter = iter(YelpReviewPolarity(split='train'))
We can then use such a generator to iterate over the data but also to inspect individual records. For example:
Python
Copy
>>> next(train_iter)
Shell
Copy
(1,
"Unfortunately, the frustration of being Dr. Goldberg's patient is a
repeat of the experience I've had with so many other doctors in NYC --
good doctor, terrible staff. It seems that his staff simply never answers
the phone. It usually takes 2 hours of repeated calling to get an answer.
Who has time for that or wants to deal with it? I have run into this
problem with many other doctors and I just don't get it. You have office
workers, you have patients with medical needs, why isn't anyone answering
the phone? It's incomprehensible and not work the aggravation. It's with
regret that I feel that I have to give Dr. Goldberg 2 stars.")
In the second step, we create the vocabulary of the corpus. To do so, we rely on the build_vocab_from_iterator from the torchtext.vocab.
Python
Copy
>>> from torchtext.data.utils import get_tokenizer
>>> from torchtext.vocab import build_vocab_from_iterator
In the interest of efficiency, we build the vocabulary out of the tokenized text.
Python
Copy
>>> tokenizer = get_tokenizer('basic_eng')
Such a tokenizer is included in a function that iterates over the tuples included in the corpus and tokenizes the second element of each tuple (namely, the review text).
Python
Copy
>>> def yield_tokens(data_iter):
for _, text in data_iter:
yield tokenizer(text)
Passing the tokenized documents to build_vocab_from_iterator produces a Vocab class object.
Python
Copy
>>> vocab = build_vocab_from_iterator(
yield_tokens(train_iter), specials=["<unk>"]
)
>>> vocab.set_default_index(vocab["<unk>"])
The third step consists of pre-processing the reviews included in the corpus. Mainly, we create two simple pipes to tokenize the text and to encode the class of the review (bad or good).
Python
Copy
>>> text_pipeline = lambda x: vocab(tokenizer(x))
>>> label_pipeline = lambda x: int(x) - 1
Actually, this third step can be merged with the fourth step, wherein data are loaded. The key idea is that the tokenization can be efficiently carried out as we load the data. To do so, it is necessary to create a function to pass to the argument collate_fn in the DataLoader method (to be imported in the next line).
Python
Copy
>>> from torch.utils.data import DataLoader
The below-displayed function has two logical steps:
three containers are created and populated with the review labels, tokenized text, and length of the tokenized text
Python
Copy
>>> def collate_batch(batch):
label_list, text_list, offsets = [], [], [0]. # step 1
for (_label, _text) in batch:
label_list.append(label_pipeline(_label))
processed_text = torch.tensor(
text_pipeline(_text), dtype=torch.int64
)
text_list.append(processed_text)
offsets.append(processed_text.size(0))
label_list = torch.tensor(label_list, dtype=torch.int64) # step 2
offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
text_list = torch.cat(text_list)
return label_list.to(device), text_list.to(device), offsets.to(device)
A torch.device object can be of type βcudaβ βΒ if a GPU is available β or βcpuβ.
Python
Copy
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
A DataLoader object is created by passing a stream of data gathering, the required batch size, and other optional arguments, including the above-defined collate_batch function.
Python
Copy
>>> dataloader = DataLoader(
train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
)
In the fifth step, we create an ad hoc class, called TextClassificationModel, which contains:
a hidden, sparse layer β embedding.bag β and a Linear layer
self.embedding takes as many inputs as the length of the dictionary, vocab_size; the output has to be fixed by setting a value for embed_dim
self.fc takes the number of inputs defined in embed_dim and it yields as many classes as num_class (in our case we have two labels)
weight-related choices
the connection between the two layers
self.fc takes the outcome of self.embedding as input
Python
Copy
>>> class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_class): # step 1
super(TextClassificationModel, self).__init__()
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
self.fc = nn.Linear(embed_dim, num_class)
self.init_weights()
def init_weights(self): # step 2
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
def forward(self, text, offsets):
embedded = self.embedding(text, offsets)
return self.fc(embedded)
Now, we use the newly created class TextClassificationModel. We start by setting up the parameters and initializing a new TextClassificationModel names model.
Python
Copy
>>> num_class = len(set([label for (label, text) in train_iter]))
>>> vocab_size = len(vocab)
>>> emsize = 64
>>> model = TextClassificationModel(vocab_size, emsize, num_class).to(device)
In the final step, we train and evaluate model. In the following cell, we create two functions for training and evaluation respectively. The function train takes a dataloader instance as argument. At the core of the function, there are two elements:
optimizer is an instance of torch.optim.SGD, which implements stochastic gradient descent
torch.nn.utils.clip_grad_norm_, which clips gradient norm of an iterable of parameters (in our case, model.parameters).
The function evaluate the results of the training. criterion, an instance of torch.nn.CrossEntropyLoss, computes the cross entropy loss between input and target (i.e., actual and predicted labels).
Python
Copy
>>> def train(dataloader):
model.train()
total_acc, total_count = 0, 0
log_interval = 500
start_time = time.time()
for idx, (label, text, offsets) in enumerate(dataloader):
optimizer.zero_grad()
predicted_label = model(text, offsets)
loss = criterion(predicted_label, label)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
optimizer.step()
total_acc += (predicted_label.argmax(1) == label).sum().item()
total_count += label.size(0)
if idx % log_interval == 0 and idx > 0:
elapsed = time.time() - start_time
print(
'| epoch {:3d} | {:5d}/{:5d} batches '
'| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
total_acc/total_count)
)
total_acc, total_count = 0, 0
start_time = time.time()
>>> def evaluate(dataloader):
model.eval()
total_acc, total_count = 0, 0
with torch.no_grad():
for idx, (label, text, offsets) in enumerate(dataloader):
predicted_label = model(text, offsets)
loss = criterion(predicted_label, label)
total_acc += (
predicted_label.argmax(1) == label
).sum().item()
total_count += label.size(0)
return total_acc/total_count
The following code box deploys the previous two functions and print the log to the screen
Python
Copy
>>> from torch.utils.data.dataset import random_split
>>> from torchtext.data.functional import to_map_style_dataset
>>> EPOCHS = 10
>>> LR = 5
>>> BATCH_SIZE = 64
>>> criterion = torch.nn.CrossEntropyLoss()
>>> optimizer = torch.optim.SGD(model.parameters(), lr=LR)
>>> scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
>>> total_accu = None
>>> train_iter, test_iter = YelpReviewPolarity()
>>> train_dataset = to_map_style_dataset(train_iter)
>>> test_dataset = to_map_style_dataset(test_iter)
>>> num_train = int(len(train_dataset) * 0.95)
>>> split_train_, split_valid_ = \
random_split(
train_dataset, [num_train, len(train_dataset) - num_train]
)
>>> train_dataloader = DataLoader(
split_train_,
batch_size=BATCH_SIZE,
shuffle=True,
collate_fn=collate_batch
)
>>> valid_dataloader = DataLoader(
split_valid_,
batch_size=BATCH_SIZE,
shuffle=True,
collate_fn=collate_batch
)
>>> test_dataloader = DataLoader(
test_dataset,
batch_size=BATCH_SIZE,
shuffle=True,
collate_fn=collate_batch
)
>>> for epoch in range(1, EPOCHS + 1):
epoch_start_time = time.time()
train(train_dataloader)
accu_val = evaluate(valid_dataloader)
if total_accu is not None and total_accu > accu_val:
scheduler.step()
else:
total_accu = accu_val
print('-' * 59)
print('| end of epoch {:3d} | time: {:5.2f}s | '
'valid accuracy {:8.3f} '.format(epoch,
time.time() - epoch_start_time,
accu_val))
print('-' * 59)
Finally, we can use to model to classify unseen documents.
Python
Copy
>>> print('Checking the results of test dataset.')
>>> accu _test = evaluate(test_dataloader)
>>> print('test accuracy {:8.3f}'.format(accu_test))
>>> review_label = {1: "BAD", 2: "GOOD"}
>>> def predict(text, text_pipeline):
with torch.no_grad():
text = torch.tensor(text_pipeline(text))
output = model(text, torch.tensor([0]))
return output.argmax(1).item() + 1
>>> ex_text_str = "."
>>> model = model.to("cpu")
>>> print(
"This is a %s Yelp reviews" %review_label[
predict(ex_text_str, text_pipeline)
]
)
This snippet comes from the Python script pytorch.py, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.