Text classification with PyTorch

In this script, we deal with text classification tasks with PyTorch. Using the YelpReviewPolarity database, we train a classifier with an embedding.bag hidden layer and a Linear layer predicting ‘bad’ and ‘good’ reviews. In terms of process, we carry out the following steps:

creating a stream of data reading

creating a vocabulary

pre-processing reviews

data loading

model creation

model training and evaluation

In the first step, we inherit the data from torchtext.datasets . The training dataset contains 560K reviews, while the test dataset has 38K reviews. For further information about the data, please see: https://huggingface.co/datasets/yelp_polarity

Python

Copy

>>> from torchtext.datasets import YelpReviewPolarity

When it comes iterating over the data, we can use PyTorch’s iter(), included in torch.utils.data, which creates a generator object — i.e., a stream of data reading.

Python

Copy

>>> train_iter = iter(YelpReviewPolarity(split='train'))

We can then use such a generator to iterate over the data but also to inspect individual records. For example:

Python

Copy

>>> next(train_iter)

Shell

Copy

(1,
 "Unfortunately, the frustration of being Dr. Goldberg's patient is a 
 repeat of the experience I've had with so many other doctors in NYC -- 
 good doctor, terrible staff.  It seems that his staff simply never answers 
 the phone.  It usually takes 2 hours of repeated calling to get an answer.
 Who has time for that or wants to deal with it?  I have run into this 
 problem with many other doctors and I just don't get it.  You have office 
 workers, you have patients with medical needs, why isn't anyone answering  
 the phone?  It's incomprehensible and not work the aggravation.  It's with 
 regret that I feel that I have to give Dr. Goldberg 2 stars.")

In the second step, we create the vocabulary of the corpus. To do so, we rely on the build_vocab_from_iterator from the torchtext.vocab. 

Python

Copy

>>> from torchtext.data.utils import get_tokenizer
>>> from torchtext.vocab import build_vocab_from_iterator

In the interest of efficiency, we build the vocabulary out of the tokenized text.

Python

Copy

>>> tokenizer = get_tokenizer('basic_eng')

Such a tokenizer is included in a function that iterates over the tuples included in the corpus and tokenizes the second element of each tuple (namely, the review text).

Python

Copy

>>> def yield_tokens(data_iter):
        for _, text in data_iter:
            yield tokenizer(text)

Passing the tokenized documents to build_vocab_from_iterator produces a Vocab class object. 

Python

Copy

>>> vocab = build_vocab_from_iterator(
        yield_tokens(train_iter), specials=["<unk>"]
    )
>>> vocab.set_default_index(vocab["<unk>"])

The third step consists of pre-processing the reviews included in the corpus. Mainly, we create two simple pipes to tokenize the text and to encode the class of the review (bad or good).

Python

Copy

>>> text_pipeline = lambda x: vocab(tokenizer(x))
>>> label_pipeline = lambda x: int(x) - 1

Actually, this third step can be merged with the fourth step, wherein data are loaded. The key idea is that the tokenization can be efficiently carried out as we load the data. To do so, it is necessary to create a function to pass to the argument collate_fn in the DataLoader method (to be imported in the next line).

Python

Copy

>>> from torch.utils.data import DataLoader

The below-displayed function has two logical steps:

three containers are created and populated with the review labels, tokenized text, and length of the tokenized text

then, these containers are transformed into tensors, which are eventually allocated to device 

Python

Copy

>>> def collate_batch(batch):
        label_list, text_list, offsets = [], [], [0].            # step 1
        for (_label, _text) in batch:
            label_list.append(label_pipeline(_label))
            processed_text = torch.tensor(
                text_pipeline(_text), dtype=torch.int64
            )
            text_list.append(processed_text)
            offsets.append(processed_text.size(0))
        label_list = torch.tensor(label_list, dtype=torch.int64) # step 2
        offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
        text_list = torch.cat(text_list)
        return label_list.to(device), text_list.to(device), offsets.to(device)

A torch.device object can be of type ‘cuda’ — if a GPU is available — or ‘cpu’. 

Python

Copy

>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

A DataLoader object is created by passing a stream of data gathering, the required batch size, and other optional arguments, including the above-defined collate_batch function.

Python

Copy

>>> dataloader = DataLoader(
        train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch
    )

In the fifth step, we create an ad hoc class, called TextClassificationModel, which contains:

a hidden, sparse layer — embedding.bag — and a Linear layer

self.embedding takes as many inputs as the length of the dictionary, vocab_size; the output has to be fixed by setting a value for embed_dim

self.fc takes the number of inputs defined in embed_dim and it yields as many classes as num_class (in our case we have two labels)

weight-related choices

the connection between the two layers

self.fc takes the outcome of self.embedding as input

Python

Copy

>>> class TextClassificationModel(nn.Module):

    def __init__(self, vocab_size, embed_dim, num_class):         # step 1
        super(TextClassificationModel, self).__init__()
        self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=True)
        self.fc = nn.Linear(embed_dim, num_class)
        self.init_weights()

    def init_weights(self):                                       # step 2
        initrange = 0.5
        self.embedding.weight.data.uniform_(-initrange, initrange) 
        self.fc.weight.data.uniform_(-initrange, initrange)
        self.fc.bias.data.zero_()

    def forward(self, text, offsets):                                            
        embedded = self.embedding(text, offsets)
        return self.fc(embedded)

Now, we use the newly created class TextClassificationModel. We start by setting up the parameters and initializing a new TextClassificationModel names model.

Python

Copy

>>> num_class = len(set([label for (label, text) in train_iter]))
>>> vocab_size = len(vocab)
>>> emsize = 64
>>> model = TextClassificationModel(vocab_size, emsize, num_class).to(device)

In the final step, we train and evaluate model. In the following cell, we create two functions for training and evaluation respectively. The function train takes a dataloader instance as argument. At the core of the function, there are two elements:

optimizer is an instance of torch.optim.SGD, which implements stochastic gradient descent

torch.nn.utils.clip_grad_norm_, which clips gradient norm of an iterable of parameters (in our case, model.parameters).

The function evaluate the results of the training. criterion, an instance of torch.nn.CrossEntropyLoss, computes the cross entropy loss between input and target (i.e., actual and predicted labels).

Python

Copy

>>> def train(dataloader):
        model.train()
        total_acc, total_count = 0, 0
        log_interval = 500
        start_time = time.time()

        for idx, (label, text, offsets) in enumerate(dataloader):
            optimizer.zero_grad()
            predicted_label = model(text, offsets)
            loss = criterion(predicted_label, label)
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
            optimizer.step()
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
            if idx % log_interval == 0 and idx > 0:
                elapsed = time.time() - start_time
                print(
                    '| epoch {:3d} | {:5d}/{:5d} batches '
                    '| accuracy {:8.3f}'.format(epoch, idx, len(dataloader),
                                              total_acc/total_count)
                )
                total_acc, total_count = 0, 0
                start_time = time.time()

>>> def evaluate(dataloader):
        model.eval()
        total_acc, total_count = 0, 0

        with torch.no_grad():
            for idx, (label, text, offsets) in enumerate(dataloader):
                predicted_label = model(text, offsets)
                loss = criterion(predicted_label, label)
                total_acc += (
                    predicted_label.argmax(1) == label
                ).sum().item()
                total_count += label.size(0)
        return total_acc/total_count

The following code box deploys the previous two functions and print the log to the screen

Python

Copy

>>> from torch.utils.data.dataset import random_split
>>> from torchtext.data.functional import to_map_style_dataset
>>> EPOCHS = 10 
>>> LR = 5  
>>> BATCH_SIZE = 64 
>>> criterion = torch.nn.CrossEntropyLoss()
>>> optimizer = torch.optim.SGD(model.parameters(), lr=LR)
>>> scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
>>> total_accu = None
>>> train_iter, test_iter = YelpReviewPolarity()
>>> train_dataset = to_map_style_dataset(train_iter)
>>> test_dataset = to_map_style_dataset(test_iter)
>>> num_train = int(len(train_dataset) * 0.95)
>>> split_train_, split_valid_ = \
        random_split(
            train_dataset, [num_train, len(train_dataset) - num_train]
        )
>>> train_dataloader = DataLoader(
                           split_train_,
                           batch_size=BATCH_SIZE,
                           shuffle=True,
                           collate_fn=collate_batch
                           )
>>> valid_dataloader = DataLoader(
                           split_valid_,
                           batch_size=BATCH_SIZE,
                           shuffle=True,
                           collate_fn=collate_batch
                           )
>>> test_dataloader = DataLoader(
                          test_dataset,
                          batch_size=BATCH_SIZE,
                          shuffle=True,
                          collate_fn=collate_batch
                          )
>>> for epoch in range(1, EPOCHS + 1):
        epoch_start_time = time.time()
        train(train_dataloader)
        accu_val = evaluate(valid_dataloader)
        if total_accu is not None and total_accu > accu_val:
            scheduler.step()
        else:
            total_accu = accu_val
        print('-' * 59)
        print('| end of epoch {:3d} | time: {:5.2f}s | '
              'valid accuracy {:8.3f} '.format(epoch,
                                           time.time() - epoch_start_time,
                                           accu_val))
        print('-' * 59)

Finally, we can use to model to classify unseen documents.

Python

Copy

>>> print('Checking the results of test dataset.')
>>> accu _test = evaluate(test_dataloader)
>>> print('test accuracy {:8.3f}'.format(accu_test))
>>> review_label = {1: "BAD", 2: "GOOD"}
>>> def predict(text, text_pipeline):
        with torch.no_grad():
            text = torch.tensor(text_pipeline(text))
            output = model(text, torch.tensor([0]))
            return output.argmax(1).item() + 1
>>> ex_text_str = "."
>>> model = model.to("cpu")
>>> print(
        "This is a %s Yelp reviews" %review_label[
            predict(ex_text_str, text_pipeline)
            ]
         )

This snippet comes from the Python script pytorch.py, hosted in the GitHub repo simoneSantoni/NLP-orgs-markets.

NLP-orgs-markets/pytorch.py at ff376153e8d5911a91f8218721f8e54ed855caea · simoneSantoni/NLP-orgs-markets

This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters You can't perform that action at this time. You signed in with another tab or window.

https://github.com/simoneSantoni/NLP-orgs-markets/blob/ff376153e8d5911a91f8218721f8e54ed855caea/textClassification/pytorch.py