Next, let’s prepare two separate `TextVectorization` layers: one for English and one
for Spanish. We’re going to need to customize the way strings are preprocessed:
- We need to preserve the "[start]" and "[end]" tokens that we’ve inserted. By default, the characters [ and ] would be stripped, but we want to keep them around so we can tell apart the word “start” and the start token "[start]".
- Punctuation is different from language to language! In the Spanish `TextVectorization` layer, if we’re going to strip punctuation characters, we need to
also strip the character ¿.
%% Cell type:markdown id: tags:
Note that for a non-toy translation model, we would treat punctuation characters as separate
tokens rather than stripping them, since we would want to be able to generate correctly
punctuated sentences. In our case, for simplicity, we’ll get rid of all punctuation.
%% Cell type:code id: tags:
``` python
import tensorflow as tf
import string
import re
strip_chars = string.punctuation + "¿"
strip_chars = strip_chars.replace("[", "")
strip_chars = strip_chars.replace("]", "")
def custom_standardization(input_string):
lowercase = tf.strings.lower(input_string)
return tf.strings.regex_replace(
# Prepare a custom string standardization function for the
# Spanish TextVectorization layer: it preserves [ and ] but strips ¿
However, there are two major issues with this approach:
- The target sequence must always be the same length as the source sequence. In practice, this is rarely the case. Technically, this isn’t critical, as you could always pad either the source sequence or the target sequence to make their lengths match.
- Due to the step-by-step nature of RNNs, the model will only be looking at tokens $0\ldots N$ in the source sequence in order to predict token $N$ in the target sequence. This constraint makes this setup unsuitable for most tasks, and particularly translation. Consider translating “The weather is nice today” to French — that would be “Il fait beau aujourd’hui.” You’d need to be able to predict “Il” from just “The,” “Il fait” from just “The weather,” etc., which is simply impossible.
%% Cell type:markdown id: tags:
If you’re a human translator, you’d start by reading the entire source sentence before
starting to translate it. This is especially important if you’re dealing with languages
that have wildly different word ordering, like English and Japanese. And that’s exactly
what standard sequence-to-sequence models do.
In a proper sequence-to-sequence setup, you would first use an
RNN (the encoder) to turn the entire source sequence into a single vector (or set of
vectors). This could be the last output of the RNN, or alternatively, its final internal
state vectors. Then you would use this vector (or vectors) as the initial state of another
%% Cell type:markdown id: tags:
RNN (the decoder), which would look at elements $0\ldots N$ in the target sequence, and
try to predict step $N+1$ in the target sequence.
Let’s implement this in Keras with GRU-based encoders and decoders. The choice
of GRU rather than LSTM makes things a bit simpler, since GRU only has a single
state vector, whereas LSTM has multiple. Let’s start with the encoder.
%% Cell type:markdown id: tags:
### GRU-based encoder
%% Cell type:code id: tags:
``` python
from tensorflow import keras
from tensorflow.keras import layers
embed_dim = 256
latent_dim = 1024
# The English source sentence goes here. Specifying the name of the input enables
Note that this inference setup, while very simple, is rather inefficient, since we reprocess
the entire source sentence and the entire generated target sentence every time
we sample a new word. In a practical application, you’d factor the encoder and the
decoder as two separate models, and your decoder would only run a single step at
each token-sampling iteration, reusing its previous internal state.
Here are our translation results. Our model works decently well for a toy model,
though it still makes many basic mistakes.
%% Cell type:markdown id: tags:
There are many ways this toy model could be improved: We could use a deep stack of
recurrent layers for both the encoder and the decoder (note that for the decoder, this
makes state management a bit more involved). We could use an LSTM instead of a GRU.
And so on. Beyond such tweaks, however, the RNN approach to sequence-to-sequence
learning has a few fundamental limitations:
- The source sequence representation has to be held entirely in the encoder state vector(s), which puts significant limitations on the size and complexity of the sentences you can translate. It’s a bit as if a human were translating a sentence entirely from memory, without looking twice at the source sentence while producing the translation.
- RNNs have trouble dealing with very long sequences, since they tend to progressively forget about the past—by the time you’ve reached the 100th token in either sequence, little information remains about the start of the sequence. That means RNN-based models can’t hold onto long-term context, which can be essential for translating long documents.
%% Cell type:markdown id: tags:
These limitations are what has led the machine learning community to embrace the
Transformer architecture for sequence-to-sequence problems. Let’s take a look.
%% Cell type:markdown id: tags:
# Part II : Attention Basics
In this notebook, we look at how attention is implemented. We will focus on implementing attention in isolation from a larger model. That's because when implementing attention in a real-world model, a lot of the focus goes into piping the data and juggling the various vectors rather than the concepts of attention themselves.
We will implement attention scoring as well as calculating an attention context vector.
## Attention Scoring
### Inputs to the scoring function
Let's start by looking at the inputs we'll give to the scoring function. We will assume we're in the first step in the decoding phase. The first input to the scoring function is the hidden state of decoder (assuming a toy RNN with three hidden nodes -- not usable in real life, but easier to illustrate):
Let's calculate the dot product of a single annotation. Numpy's [dot()](https://docs.scipy.org/doc/numpy/reference/generated/numpy.dot.html) is a good candidate for this operation
To do that, we'll have to transpose `dec_hidden_state` and [matrix multiply](https://docs.scipy.org/doc/numpy/reference/generated/numpy.matmul.html) it with `annotations`.
Even when knowing which annotation will get the most focus, it's interesting to see how drastic softmax makes the end score become. The first and last annotation had the respective scores of 927 and 929. But after softmax, the attention they'll get is 0.119 and 0.880 respectively.
# Applying the scores back on the annotations
Now that we have our scores, let's multiply each annotation by its score to proceed closer to the attention context vector. This is the multiplication part of this formula (we'll tackle the summation part in the latter cells)