Commit 3d6d8743 authored by Mirko Birbaumer's avatar Mirko Birbaumer
Browse files

Cells executed

parent a1c48493
%% Cell type:markdown id: tags:
 
# Part I : A Machine Translation Example with RNN Sequence-to-Sequence Models
 
We’ll demonstrate sequence-to-sequence modeling on a machine translation task.
We’ll start with a recurrent sequence model, and we’ll follow up with the full Transformer architecture in Part III.
 
We’ll be working with an English-to-Spanish translation dataset available at
www.manythings.org/anki/. Let’s download it:
 
%% Cell type:code id: tags:
 
``` python
!wget http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
!unzip -q spa-eng.zip
```
 
%%%% Output: stream
 
--2022-10-25 12:27:39-- http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 172.217.168.80, 216.58.215.240, 142.250.203.112, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|172.217.168.80|:80... connected.
--2022-10-26 13:50:24-- http://storage.googleapis.com/download.tensorflow.org/data/spa-eng.zip
Resolving storage.googleapis.com (storage.googleapis.com)... 142.250.203.112, 172.217.168.16, 172.217.168.80, ...
Connecting to storage.googleapis.com (storage.googleapis.com)|142.250.203.112|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2638744 (2,5M) [application/zip]
Length: 2638744 (2.5M) [application/zip]
Saving to: ‘spa-eng.zip’
spa-eng.zip 100%[===================>] 2,52M 12,0MB/s in 0,2s
spa-eng.zip 100%[===================>] 2.52M --.-KB/s in 0.1s
2022-10-25 12:27:39 (12,0 MB/s) - ‘spa-eng.zip’ saved [2638744/2638744]
2022-10-26 13:50:24 (25.4 MB/s) - ‘spa-eng.zip’ saved [2638744/2638744]
 
%% Cell type:markdown id: tags:
 
The text file contains one example per line: an English sentence, followed by a tab
character, followed by the corresponding Spanish sentence. Let’s parse this file.
 
%% Cell type:code id: tags:
 
``` python
text_file = "spa-eng/spa.txt"
with open(text_file) as f:
# Iterate over the lines in the file.
lines = f.read().split("\n")[:-1]
text_pairs = []
for line in lines:
english, spanish = line.split("\t")
# Each line contains an English phrase and its
# Spanish translation, tab-separated.
spanish = "[start] " + spanish + " [end]"
# We prepend "[start]" and append "[end]" to the Spanish
# sentence, to match the template
text_pairs.append((english, spanish))
```
 
%% Cell type:markdown id: tags:
 
Our text_pairs look like this:
Our `text_pairs` look like this:
 
%% Cell type:code id: tags:
 
``` python
import random
print(random.choice(text_pairs))
```
 
%%%% Output: stream
("The time is always right to do what's right.", '[start] Siempre es el momento adecuado para hacer lo que es adecuado. [end]')
%% Cell type:markdown id: tags:
 
Let’s shuffle them and split them into the usual training, validation, and test sets:
 
%% Cell type:code id: tags:
 
``` python
import random
random.shuffle(text_pairs)
num_val_samples = int(0.15 * len(text_pairs))
num_train_samples = len(text_pairs) - 2 * num_val_samples
train_pairs = text_pairs[:num_train_samples]
val_pairs = text_pairs[num_train_samples:num_train_samples + num_val_samples]
test_pairs = text_pairs[num_train_samples + num_val_samples:]
```
 
%% Cell type:markdown id: tags:
 
Next, let’s prepare two separate `TextVectorization` layers: one for English and one
for Spanish. We’re going to need to customize the way strings are preprocessed:
 
- We need to preserve the "[start]" and "[end]" tokens that we’ve inserted. By default, the characters [ and ] would be stripped, but we want to keep them around so we can tell apart the word “start” and the start token "[start]".
- Punctuation is different from language to language! In the Spanish `TextVectorization` layer, if we’re going to strip punctuation characters, we need to
also strip the character ¿.
 
%% Cell type:markdown id: tags:
 
Note that for a non-toy translation model, we would treat punctuation characters as separate
tokens rather than stripping them, since we would want to be able to generate correctly
punctuated sentences. In our case, for simplicity, we’ll get rid of all punctuation.
 
%% Cell type:code id: tags:
 
``` python
import tensorflow as tf
import string
import re
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")
def custom_standardization(input_string):
lowercase = tf.strings.lower(input_string)
return tf.strings.regex_replace(
# Prepare a custom string standardization function for the
# Spanish TextVectorization layer: it preserves [ and ] but strips ¿
# (as well as all other characters
# from strings.punctuation).
lowercase, f"[{re.escape(strip_chars)}]", "")
```
 
%% Cell type:code id: tags:
 
``` python
import tensorflow as tf
# To keep things simple, we’ll only look at
# the top 15,000 words in each language,
# and we’ll restrict sentences to 20 words.
vocab_size = 15000
sequence_length = 20
 
# The English layer
source_vectorization = tf.keras.layers.TextVectorization(
max_tokens=vocab_size,
output_mode="int",
output_sequence_length=sequence_length,
)
# The Spanish layer
target_vectorization = tf.keras.layers.TextVectorization(
max_tokens=vocab_size,
output_mode="int",
# Generate Spanish sentences that have one extra token,
# since we’ll need to offset the sentence by one step
# during training.
output_sequence_length=sequence_length + 1,
standardize=custom_standardization,
)
train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
# Learn the vocabulary of each language.
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)
 
train_english_texts = [pair[0] for pair in train_pairs]
train_spanish_texts = [pair[1] for pair in train_pairs]
source_vectorization.adapt(train_english_texts)
target_vectorization.adapt(train_spanish_texts)
```
 
%% Cell type:markdown id: tags:
 
Finally, we can turn our data into a `tf.data` pipeline. We want it to return a tuple
`(inputs, target)` where `inputs` is a dict with two keys, “encoder_inputs” (the English
sentence) and “decoder_inputs” (the Spanish sentence), and `target` is the Spanish
sentence offset by one step ahead.
 
%% Cell type:code id: tags:
 
``` python
batch_size = 64
def format_dataset(eng, spa):
eng = source_vectorization(eng)
spa = target_vectorization(spa)
return ({
"english": eng,
# The input Spanish sentence doesn’t include the last token
# to keep inputs and targets at the same length.
"spanish": spa[:, :-1],
# The target Spanish sentence is one step ahead. Both are still
# the same length (20 words).
}, spa[:, 1:])
 
def make_dataset(pairs):
eng_texts, spa_texts = zip(*pairs)
eng_texts = list(eng_texts)
spa_texts = list(spa_texts)
dataset = tf.data.Dataset.from_tensor_slices((eng_texts, spa_texts))
dataset = dataset.batch(batch_size)
dataset = dataset.map(format_dataset, num_parallel_calls=4)
# Use in-memory caching to speed up preprocessing.
return dataset.shuffle(2048).prefetch(16).cache()
 
train_ds = make_dataset(train_pairs)
val_ds = make_dataset(val_pairs)
```
 
%% Cell type:code id: tags:
 
``` python
for inputs, targets in train_ds.take(1):
print(f"inputs['english'].shape: {inputs['english'].shape}")
print(f"inputs['spanish'].shape: {inputs['spanish'].shape}")
print(f"targets.shape: {targets.shape}")
```
 
%% Cell type:markdown id: tags:
 
## Sequence-to-sequence learning with RNNs
## Sequence-to-Sequence Learning with RNNs
 
Recurrent neural networks dominated sequence-to-sequence learning from 2015–2017
before being overtaken by Transformer. They were the basis for many realworld
machine-translation systems. Google Translate
circa 2017 was powered by a stack of seven large LSTM layers. It’s still worth learning
about this approach today, as it provides an easy entry point to understanding
sequence-to-sequence models.
 
The simplest, naive way to use RNNs to turn a sequence into another sequence is
to keep the output of the RNN at each time step. In Keras, it would look like this:
 
%% Cell type:code id: tags:
 
``` python
inputs = tf.keras.Input(shape=(sequence_length,), dtype="int64")
x = tf.keras.layers.Embedding(input_dim=vocab_size, output_dim=128)(inputs)
x = tf.keras.layers.LSTM(32, return_sequences=True)(x)
outputs = tf.keras.layers.Dense(vocab_size, activation="softmax")(x)
model = tf.keras.Model(inputs, outputs)
```
 
%% Cell type:markdown id: tags:
 
However, there are two major issues with this approach:
 
- The target sequence must always be the same length as the source sequence. In practice, this is rarely the case. Technically, this isn’t critical, as you could always pad either the source sequence or the target sequence to make their lengths match.
 
- Due to the step-by-step nature of RNNs, the model will only be looking at tokens $0\ldots N$ in the source sequence in order to predict token $N$ in the target sequence. This constraint makes this setup unsuitable for most tasks, and particularly translation. Consider translating “The weather is nice today” to French — that would be “Il fait beau aujourd’hui.” You’d need to be able to predict “Il” from just “The,” “Il fait” from just “The weather,” etc., which is simply impossible.
 
%% Cell type:markdown id: tags:
 
If you’re a human translator, you’d start by reading the entire source sentence before
starting to translate it. This is especially important if you’re dealing with languages
that have wildly different word ordering, like English and Japanese. And that’s exactly
what standard sequence-to-sequence models do.
 
In a proper sequence-to-sequence setup, you would first use an
RNN (the encoder) to turn the entire source sequence into a single vector (or set of
vectors). This could be the last output of the RNN, or alternatively, its final internal
state vectors. Then you would use this vector (or vectors) as the initial state of another
 
%% Cell type:markdown id: tags:
 
RNN (the decoder), which would look at elements $0\ldots N$ in the target sequence, and
try to predict step $N+1$ in the target sequence.
 
Let’s implement this in Keras with GRU-based encoders and decoders. The choice
of GRU rather than LSTM makes things a bit simpler, since GRU only has a single
state vector, whereas LSTM has multiple. Let’s start with the encoder.
 
%% Cell type:markdown id: tags:
 
### GRU-based encoder
 
%% Cell type:code id: tags:
 
``` python
from tensorflow import keras
from tensorflow.keras import layers
embed_dim = 256
latent_dim = 1024
 
# The English source sentence goes here. Specifying the name of the input enables
# us to fit() the model with a dict of inputs.
source = keras.Input(shape=(None,), dtype="int64", name="english")
# Don’t forget masking: it’s critical in this setup.
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(source)
encoded_source = layers.Bidirectional(
# Our encoded source sentence is the last output of a bidirectional GRU.
layers.GRU(latent_dim), merge_mode="sum")(x)
```
 
%% Cell type:markdown id: tags:
 
Next, let’s add the decoder — a simple GRU layer that takes as its initial state the
encoded source sentence. On top of it, we add a Dense layer that produces for each
output step a probability distribution over the Spanish vocabulary.
 
%% Cell type:markdown id: tags:
 
#### GRU-based decoder and the end-to-end model
 
%% Cell type:code id: tags:
 
``` python
# The Spanish target sentence goes here.
past_target = keras.Input(shape=(None,), dtype="int64", name="spanish")
# Don’t forget masking.
x = layers.Embedding(vocab_size, embed_dim, mask_zero=True)(past_target)
decoder_gru = layers.GRU(latent_dim, return_sequences=True)
# The encoded source sentence serves as the initial state of
# the decoder GRU.
x = decoder_gru(x, initial_state=encoded_source)
x = layers.Dropout(0.5)(x)
# Predicts the next token
target_next_step = layers.Dense(vocab_size, activation="softmax")(x)
# End-to-end model: maps the source sentence and the target sentence to the
# target sentence one step in the future
seq2seq_rnn = keras.Model([source, past_target], target_next_step)
```
 
%% Cell type:markdown id: tags:
 
During training, the decoder takes as input the entire target sequence, but thanks to
the step-by-step nature of RNNs, it only looks at tokens $0\ldots N$ in the input to predict
token $N$ in the output (which corresponds to the next token in the sequence, since
the output is intended to be offset by one step). This means we only use information
from the past to predict the future, as we should; otherwise we’d be cheating, and our
model would not work at inference time.
Let’s start training.
 
%% Cell type:code id: tags:
 
``` python
seq2seq_rnn.compile(
optimizer="rmsprop",
loss="sparse_categorical_crossentropy",
metrics=["accuracy"])
seq2seq_rnn.fit(train_ds, epochs=15, validation_data=val_ds)
```
 
%% Cell type:markdown id: tags:
 
We picked accuracy as a crude way to monitor validation-set performance during
training. We get to 64% accuracy: on average, the model predicts the next word in the
Spanish sentence correctly 64% of the time. However, in practice, next-token accuracy
isn’t a great metric for machine translation models, in particular because it makes the
assumption that the correct target tokens from $0$ to $N$ are already known when predicting
token N+1. In reality, during inference, you’re generating the target sentence
from scratch, and you can’t rely on previously generated tokens being 100% correct.
If you work on a real-world machine translation system, you will likely use “BLEU
scores” to evaluate your models — a metric that looks at entire generated sequences
and that seems to correlate well with human perception of translation quality.
 
 
At last, let’s use our model for inference. We’ll pick a few sentences in the test set
and check how our model translates them. We’ll start from the seed token, "[start]",
and feed it into the decoder model, together with the encoded English source sentence.
We’ll retrieve a next-token prediction, and we’ll re-inject it into the decoder
repeatedly, sampling one new target token at each iteration, until we get to "[end]"
or reach the maximum sentence length.
 
%% Cell type:markdown id: tags:
 
#### Translating new sentences with our RNN encoder and decoder
 
%% Cell type:code id: tags:
 
``` python
import numpy as np
spa_vocab = target_vectorization.get_vocabulary()
# Prepare a dict to convert token index predictions to string tokens.
spa_index_lookup = dict(zip(range(len(spa_vocab)), spa_vocab))
max_decoded_sentence_length = 20
def decode_sequence(input_sentence):
tokenized_input_sentence = source_vectorization([input_sentence])
# Seed token
decoded_sentence = "[start]"
for i in range(max_decoded_sentence_length):
tokenized_target_sentence = target_vectorization([decoded_sentence])
# Sample the next token.
next_token_predictions = seq2seq_rnn.predict(
[tokenized_input_sentence, tokenized_target_sentence])
# Convert the next token prediction to
# a string and append it to the generated sentence.
sampled_token_index = np.argmax(next_token_predictions[0, i, :])
sampled_token = spa_index_lookup[sampled_token_index]
decoded_sentence += " " + sampled_token
# Exit condition: either hit max
# length or sample a stop character
if sampled_token == "[end]":
break
return decoded_sentence
```
 
%% Cell type:code id: tags:
 
``` python
test_eng_texts = [pair[0] for pair in test_pairs]
for _ in range(20):
input_sentence = random.choice(test_eng_texts)
print("-")
print(input_sentence)
print(decode_sequence(input_sentence))
```
 
%% Cell type:markdown id: tags:
 
Note that this inference setup, while very simple, is rather inefficient, since we reprocess
the entire source sentence and the entire generated target sentence every time
we sample a new word. In a practical application, you’d factor the encoder and the
decoder as two separate models, and your decoder would only run a single step at
each token-sampling iteration, reusing its previous internal state.
 
Here are our translation results. Our model works decently well for a toy model,
though it still makes many basic mistakes.
 
%% Cell type:markdown id: tags:
 
There are many ways this toy model could be improved: We could use a deep stack of
recurrent layers for both the encoder and the decoder (note that for the decoder, this
makes state management a bit more involved). We could use an LSTM instead of a GRU.
And so on. Beyond such tweaks, however, the RNN approach to sequence-to-sequence
learning has a few fundamental limitations:
 
- The source sequence representation has to be held entirely in the encoder state vector(s), which puts significant limitations on the size and complexity of the sentences you can translate. It’s a bit as if a human were translating a sentence entirely from memory, without looking twice at the source sentence while producing the translation.
 
- RNNs have trouble dealing with very long sequences, since they tend to progressively forget about the past—by the time you’ve reached the 100th token in either sequence, little information remains about the start of the sequence. That means RNN-based models can’t hold onto long-term context, which can be essential for translating long documents.
 
%% Cell type:markdown id: tags:
 
These limitations are what has led the machine learning community to embrace the
Transformer architecture for sequence-to-sequence problems. Let’s take a look.
 
%% Cell type:markdown id: tags:
 
# Part II : Attention Basics
In this notebook, we look at how attention is implemented. We will focus on implementing attention in isolation from a larger model. That's because when implementing attention in a real-world model, a lot of the focus goes into piping the data and juggling the various vectors rather than the concepts of attention themselves.
 
We will implement attention scoring as well as calculating an attention context vector.
 
## Attention Scoring
### Inputs to the scoring function
Let's start by looking at the inputs we'll give to the scoring function. We will assume we're in the first step in the decoding phase. The first input to the scoring function is the hidden state of decoder (assuming a toy RNN with three hidden nodes -- not usable in real life, but easier to illustrate):
 
%% Cell type:code id: tags:
 
``` python
dec_hidden_state = [5,1,20]
```
 
%% Cell type:markdown id: tags:
 
Let's visualize this vector:
 
%% Cell type:code id: tags:
 
``` python
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
# Let's visualize our decoder hidden state
plt.figure(figsize=(1.5, 4.5))
sns.heatmap(np.transpose(np.matrix(dec_hidden_state)), annot=True, cmap=sns.light_palette("purple", as_cmap=True), linewidths=1)
```
 
%%%% Output: error
 
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Input In [2], in <cell line: 4>()
2 import numpy as np
3 import matplotlib.pyplot as plt
----> 4 import seaborn as sns
6 # Let's visualize our decoder hidden state
7 plt.figure(figsize=(1.5, 4.5))
ModuleNotFoundError: No module named 'seaborn'
 
%% Cell type:markdown id: tags:
 
Our first scoring function will score a single annotation (encoder hidden state), which looks like this:
 
%% Cell type:code id: tags:
 
``` python
annotation = [3,12,45] #e.g. Encoder hidden state
```
 
%% Cell type:code id: tags:
 
``` python
# Let's visualize the single annotation
plt.figure(figsize=(1.5, 4.5))
sns.heatmap(np.transpose(np.matrix(annotation)), annot=True, cmap=sns.light_palette("orange", as_cmap=True), linewidths=1)
```
 
%%%% Output: execute_result
 
<AxesSubplot:>
 
%%%% Output: display_data
 
![]()
 
%% Cell type:markdown id: tags:
 
### IMPLEMENT: Scoring a Single Annotation
Let's calculate the dot product of a single annotation. Numpy's [dot()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html) is a good candidate for this operation
 
%% Cell type:code id: tags:
 
``` python
def single_dot_attention_score(dec_hidden_state, enc_hidden_state):
# TODO: return the dot product of the two vectors
return np.dot(dec_hidden_state, enc_hidden_state)
 
single_dot_attention_score(dec_hidden_state, annotation)
```
 
%%%% Output: execute_result
 
927
 
%% Cell type:markdown id: tags:
 
### Annotations Matrix
Let's now look at scoring all the annotations at once. To do that, here's our annotation matrix:
 
%% Cell type:code id: tags:
 
``` python
annotations = np.transpose([[3,12,45], [59,2,5], [1,43,5], [4,3,45.3]])
```
 
%% Cell type:markdown id: tags:
 
And it can be visualized like this (each column is a hidden state of an encoder time step):
 
%% Cell type:code id: tags:
 
``` python
# Let's visualize our annotation (each column is an annotation)
ax = sns.heatmap(annotations, annot=True, cmap=sns.light_palette("orange", as_cmap=True), linewidths=1)
```
 
%%%% Output: display_data
 
![]()
 
%% Cell type:markdown id: tags:
 
### Scoring All Annotations at Once
Let's calculate the scores of all the annotations in one step using matrix multiplication. Let's continue to us the dot scoring method
 
 
$$\text{score}\left(h_t, \overline{h}_s\right) =
\begin{cases}
h^T_t\overline{h}_s & \quad \text{dot}\\
h^T_t W_a\overline{h}_s & \quad \text{general } \\
v_a\tanh\left(W_a [h^T_t, \overline{h}_s] \right) & \quad \text{concat }
\end{cases}$$
 
 
 
To do that, we'll have to transpose `dec_hidden_state` and [matrix multiply](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matmul.html) it with `annotations`.
 
%% Cell type:code id: tags:
 
``` python
def dot_attention_score(dec_hidden_state, annotations):
# TODO: return the product of dec_hidden_state transpose and enc_hidden_states
return np.matmul(np.transpose(dec_hidden_state), annotations)
 
attention_weights_raw = dot_attention_score(dec_hidden_state, annotations)
attention_weights_raw
```
 
%%%% Output: execute_result
 
array([927., 397., 148., 929.])
 
%% Cell type:markdown id: tags:
 
Looking at these scores, can you guess which of the four vectors will get the most attention from the decoder at this time step?
 
## Softmax
Now that we have our scores, let's apply softmax:
 
 
$$ \begin{align} \alpha_t(s) & = \text{align}\left(h_t, \overline{h}_s\right) \\ & = \frac{\text{score}(h_t, \overline{h}_s)}{\sum_{s'} \text{score}(h_t, \overline{h}_{s'})} \end{align} $$
 
%% Cell type:code id: tags:
 
``` python
def softmax(x):
x = np.array(x, dtype=np.float128)
e_x = np.exp(x)
return e_x / e_x.sum(axis=0)
 
attention_weights = softmax(attention_weights_raw)
attention_weights
```
 
%%%% Output: execute_result
 
array([1.19202922e-001, 7.94715151e-232, 5.76614420e-340, 8.80797078e-001],
dtype=float128)
 
%% Cell type:markdown id: tags:
 
Even when knowing which annotation will get the most focus, it's interesting to see how drastic softmax makes the end score become. The first and last annotation had the respective scores of 927 and 929. But after softmax, the attention they'll get is 0.119 and 0.880 respectively.
 
# Applying the scores back on the annotations
Now that we have our scores, let's multiply each annotation by its score to proceed closer to the attention context vector. This is the multiplication part of this formula (we'll tackle the summation part in the latter cells)
 
$$ c_i = \sum_{j=1}^T \alpha_{ij} h_j $$
 
%% Cell type:code id: tags:
 
``` python
def apply_attention_scores(attention_weights, annotations):
# TODO: Multiple the annotations by their weights
return attention_weights * annotations