Deep learning models, being differentiable functions, can only process numeric tensors: they can’t take raw text as input. Vectorizing text is the process of transforming text into numeric tensors. Text vectorization processes come in many shapes and
forms, but they all follow the same template:
- First, you _standardize_ the text to make it easier to process, such as by converting it to lowercase or removing punctuation.
- You split the text into units (called _tokens_), such as characters, words, or groups of words. This is called _tokenization_.
- You convert each such token into a numerical vector. This will usually involve first _indexing_ all tokens present in the data.
In practice, we will work with the Keras `TextVectorization` layer, which is fast and efficient and can be dropped directly into a `tf.data` pipeline or a Keras model.
This is what the `TextVectorization` layer looks like:
# Configures the layer to return sequences of words encoded
# as integer indices. There are several other output modes
# available, which you will see in action in a bit.
output_mode="int",
)
```
%% Cell type:markdown id: tags:
By default, the `TextVectorization` layer will use the setting “convert to lowercase and remove punctuation” for text standardization, and “split on whitespace” for tokenization.
But importantly, you can provide custom functions for standardization and tokenization, which means the layer is flexible enough to handle any use case. Note that such custom functions should operate on `tf.string` tensors, not regular Python strings! For instance, the default layer behavior is equivalent to the following:
%% Cell type:code id: tags:
``` python
importre
importstring
importtensorflowastf
defcustom_standardization_fn(string_tensor):
# Convert strings to lowercase.
lowercase_string=tf.strings.lower(string_tensor)
# Replace punctuation characters with the empty string.
returntf.strings.regex_replace(lowercase_string,
f"[{re.escape(string.punctuation)}]","")
defcustom_split_fn(string_tensor):
# Split strings on whitespace.
returntf.strings.split(string_tensor)
text_vectorization=TextVectorization(
output_mode="int",
standardize=custom_standardization_fn,
split=custom_split_fn,
)
```
%% Cell type:markdown id: tags:
To index the vocabulary of a text corpus, just call the `adapt()` method of the layer with a `Dataset` object that yields strings, or just with a list of Python strings:
%% Cell type:code id: tags:
``` python
dataset=[
"I write, erase, rewrite",
"Erase again, and then",
"A poppy blooms.",
]
text_vectorization.adapt(dataset)
```
%% Cell type:markdown id: tags:
Note that you can retrieve the computed vocabulary via
`get_vocabulary()` — this can be useful if you need to convert text encoded as integer sequences back into words.
The first two entries in the vocabulary are the mask token (index 0) and the OOV token (index 1). Entries in the vocabulary list are sorted by frequency, so with a realworld dataset, very common words like “the” or “a” would come first.
The first two entries in the vocabulary are the mask token (index 0) and the OOV token ("out of vocabulary", index 1). Entries in the vocabulary list are sorted by frequency, so with a realworld dataset, very common words like “the” or “a” would come first.
%% Cell type:code id: tags:
``` python
text_vectorization.get_vocabulary()
```
%%%% Output: execute_result
['',
'[UNK]',
'erase',
'write',
'then',
'rewrite',
'poppy',
'i',
'blooms',
'and',
'again',
'a']
%% Cell type:markdown id: tags:
For a demonstration, let’s try to encode and then decode an example sentence:
%% Cell type:code id: tags:
``` python
vocabulary=text_vectorization.get_vocabulary()
test_sentence="I write, rewrite, and still rewrite again"
You’re left with a directory named aclImdb, with the following structure:
%% Cell type:raw id: tags:
aclImdb/
...train/
......pos/
......neg/
...test/
......pos/
......neg/
%% Cell type:markdown id: tags:
For instance, the train/pos/ directory contains a set of 12'500 text files, each of which contains the text body of a positive-sentiment movie review to be used as training data.
The negative-sentiment reviews live in the “neg” directories. In total, there are 25'000 text files for training and another 25'000 for testing.
There’s also a train/unsup subdirectory in there, which we don’t need. Let’s delete it:
%% Cell type:code id: tags:
``` python
!rm -r aclImdb/train/unsup
```
%% Cell type:markdown id: tags:
Take a look at the content of a few of these text files. Whether you’re working with text data or image data, remember to always inspect what your data looks like before you dive into modeling it. It will ground your intuition about what your model is actually doing:
%% Cell type:code id: tags:
``` python
!cat aclImdb/train/pos/4077_10.txt
```
%% Cell type:markdown id: tags:
Next, let’s prepare a validation set by setting apart 20% of the training text files in a new directory, aclImdb/val:
%% Cell type:code id: tags:
``` python
import os, pathlib, shutil, random
base_dir = pathlib.Path("aclImdb")
val_dir = base_dir / "val"
train_dir = base_dir / "train"
for category in ("neg", "pos"):
os.makedirs(val_dir / category)
files = os.listdir(train_dir / category)
random.Random(1337).shuffle(files)
num_val_samples = int(0.2 * len(files))
val_files = files[-num_val_samples:]
for fname in val_files:
shutil.move(train_dir / category / fname,
val_dir / category / fname)
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:code id: tags:
``` python
```
%% Cell type:markdown id: tags:
# Part II : Language Models and Recurrent Neural Networks
%% Cell type:markdown id: tags:
Many of the classical texts are no longer protected under copyright.
This means that you can download all of the text for these books for free and use them in experiments, like creating generative models. Perhaps the best place to get access to free books that are no longer protected by copyright is [Project Gutenberg](https://www.gutenberg.org/).
In this tutorial we are going to use [Goethes Faust I](http://www.gutenberg.org/files/21000/21000-8.txt)
%% Cell type:code id: tags:
``` python
from __future__ import print_function
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.models import load_model
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.utils import get_file
from tensorflow.keras.layers import Bidirectional
from tensorflow.keras.layers import Input, Embedding, Dropout, Activation
import numpy as np
import random
import sys
import io
import string
# If you prefer Nietzsche in english, then go for this text
We need to transform the raw text into a sequence of tokens or words that we can use as a source
to train the model.
Based on reviewing the raw text (above), below are some specific operations you may want to explore
yourself as an extension.
- Replace ‘–‘ with a white space so we can split words better.
- Split words based on white space.
- Remove all punctuation from words to reduce the vocabulary size (e.g. ‘What?’ becomes ‘What’).
- Remove all words that are not alphabetic to remove standalone punctuation tokens.
- Normalize all words to lowercase to reduce the vocabulary size.
Vocabulary size is a big deal with language modeling. A smaller vocabulary results in a smaller model that trains faster.
We can implement each of these cleaning operations in this order in a function. Below is the function `clean_doc()` that takes a loaded document as an argument and returns an array of clean tokens.
%% Cell type:code id: tags:
``` python
# turn a doc into clean tokens
def clean_doc(doc):
# replace '--' with a space ' '
doc = doc.replace('--', ' ')
# split into tokens by white space
tokens = doc.split()
# remove punctuation from each token
table = str.maketrans('', '', string.punctuation)
tokens = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
tokens = [word for word in tokens if word.isalpha()]
# make lower case
tokens = [word.lower() for word in tokens]
return tokens
```
%% Cell type:markdown id: tags:
We can run this cleaning operation on our loaded document and print out some of the tokens and statistics as a sanity check. After this, `doc` is a big array containing all the corpus, word by word.
We also get some statistics about the clean document.
We can see that there are approximately 33'600 words in the clean text and a vocabulary of just under 7'005 words. This is smallish and models fit on this data should be manageable on modest hardware.
%% Cell type:markdown id: tags:
### Save Clean Text
We can organize the long list of tokens into sequences of 50 input words and 1 output word.
That is, sequences of 51 words.
We can do this by iterating over the list of tokens from token 51 onwards and taking the prior 50 tokens as a sequence, then repeating this process to the end of the list of tokens.
We will transform the tokens into space-separated strings for later storage in a file.
The code to split the list of clean tokens into sequences with a length of 51 tokens is listed below.
%% Cell type:code id: tags:
``` python
# organize into sequences of tokens
length = 50 + 1
sequences = list()
for i in range(length, len(tokens)):
# select sequence of tokens
seq = tokens[i-length:i]
# convert into a line
line = ' '.join(seq)
# store
sequences.append(line)
print('Total Sequences: %d' % len(sequences))
print(sequences[0:3])
```
%%%% Output: stream
Total Sequences: 33549
['ihr naht euch wieder schwankende gestalten die früh sich einst dem trüben blick gezeigt versuch ich wohl euch diesmal fest zu halten fühl ich mein herz noch jenem wahn geneigt ihr drängt euch zu nun gut so mögt ihr walten wie ihr aus dunst und nebel um mich steigt mein busen', 'naht euch wieder schwankende gestalten die früh sich einst dem trüben blick gezeigt versuch ich wohl euch diesmal fest zu halten fühl ich mein herz noch jenem wahn geneigt ihr drängt euch zu nun gut so mögt ihr walten wie ihr aus dunst und nebel um mich steigt mein busen fühlt', 'euch wieder schwankende gestalten die früh sich einst dem trüben blick gezeigt versuch ich wohl euch diesmal fest zu halten fühl ich mein herz noch jenem wahn geneigt ihr drängt euch zu nun gut so mögt ihr walten wie ihr aus dunst und nebel um mich steigt mein busen fühlt sich']
%% Cell type:markdown id: tags:
Running this piece creates a long list of lines. Printing statistics
on the list, we can see that we will have exactly 33'549 training
patterns to fit our model.
%% Cell type:markdown id: tags:
Next, we can save the sequences to a new file for later loading.
We can define a new function for saving lines of text to a file.
This new function is called `save_doc()` and is listed below. It
takes as input a list of lines and a filename. The lines are written,
one per line, in ASCII format.
%% Cell type:code id: tags:
``` python
# save tokens to file, one dialog per line
def save_doc(lines, filename):
data = '\n'.join(lines)
file = open(filename, 'w')
file.write(data)
file.close()
```
%% Cell type:markdown id: tags:
We can call this function and save our training sequences to the file `goethe_sequences.txt`.
%% Cell type:code id: tags:
``` python
# save sequences to file
out_filename = 'goethe_sequences.txt'
save_doc(sequences, out_filename)
```
%% Cell type:markdown id: tags:
Take a look at the file with your text editor.
You will see that each line is shifted along one word, with a new word at the end to be predicted.
%% Cell type:markdown id: tags:
### Train Language Model
We can now train a statistical language model from the prepared data.
The model we will train is a neural language model. It has a few unique characteristics:
- It uses a distributed representation for words so that different words with similar meanings
will have a similar representation.
- It learns the representation at the same time as learning the model.
- It learns to predict the probability for the next word using the context of the last 50 words.
Specifically, we will use an Embedding Layer to learn the representation of words, and a Long Short-Term Memory (LSTM) recurrent neural network to learn to predict words based on their context.
Let’s start by loading our training data.
%% Cell type:markdown id: tags:
### Load Sequences
We can load our training data using the `load_doc()` function we developed in the previous section.
Once loaded, we can split the data into separate training sequences by splitting based on new lines.
The snippet below will load the ‘nietzsche_sequences.txt' from the current working directory.
%% Cell type:code id: tags:
``` python
# load doc into memory
def load_doc(filename):
# open the file as read only
file = open(filename, 'r')
# read all text
text = file.read()
# close the file
file.close()
return text
# load
in_filename = 'goethe_sequences.txt'
doc = load_doc(in_filename)
lines = doc.split('\n')
```
%% Cell type:markdown id: tags:
Next, we can encode the training data.
%% Cell type:markdown id: tags:
### Encode Sequences
The word embedding layer expects input sequences to be comprised of integers.
We can map each word in our vocabulary to a unique integer and encode our input
sequences. Later, when we make predictions, we can convert the prediction to numbers
and look up their associated words in the same mapping.
To do this encoding, we will use the `Tokenizer` class in the Keras API.
First, the Tokenizer must be trained on the entire training dataset, which means it
finds all of the unique words in the data and assigns each a unique integer.
We can then use the fit Tokenizer to encode all of the training sequences, converting
each sequence from a list of words to a list of integers.
%% Cell type:code id: tags:
``` python
from tensorflow.keras.preprocessing.text import Tokenizer