Commit fcd81614 authored by Mirko Birbaumer's avatar Mirko Birbaumer
Browse files

Formulation of questions

parent 1079c388
Pipeline #446901 passed with stage
in 12 minutes and 9 seconds
%% Cell type:markdown id: tags:
# Exercise 1 - Classifying newswires: A multiclass classification example
In this exercise, you will build a model to classify Reuters newswires into 46 mutually exclusive topics. Because we have many classes, this problem is an instance of multiclass classification, and because each data point should be classified into only one category, the problem is more specifically an instance of single-label multiclass classification.
If each data point could belong to multiple categories (in this case, topics), we’d be facing a multilabel multiclass classification problem.
%% Cell type:markdown id: tags:
### The Reuters dataset
You’ll work with the _Reuters_ dataset, a set of short newswires and their topics, published by Reuters in 1986. It’s a simple, widely used toy dataset for text classification. There are 46 different topics; some topics are more represented than others, but each topic
has at least 10 examples in the training set.
Like IMDB and MNIST, the Reuters dataset comes packaged as part of Keras. Let’s take a look.
%% Cell type:code id: tags:
``` python
from tensorflow.keras.datasets import reuters
(train_data, train_labels), (test_data, test_labels) = reuters.load_data(num_words=10000)
```
%% Cell type:markdown id: tags:
As with the IMDB dataset, the argument `num_words=10000` restricts the data to the 10'000 most frequently occurring words found in the data.
You have 8'982 training examples and 2'246 test examples:
%% Cell type:code id: tags:
``` python
print(len(train_data))
print(len(test_data))
```
%%%% Output: stream
8982
2246
%% Cell type:markdown id: tags:
As with the IMDB reviews, each example is a list of integers (word indices):
%% Cell type:code id: tags:
``` python
train_data[10]
```
%%%% Output: execute_result
[1,
245,
273,
207,
156,
53,
74,
160,
26,
14,
46,
296,
26,
39,
74,
2979,
3554,
14,
46,
4689,
4329,
86,
61,
3499,
4795,
14,
61,
451,
4329,
17,
12]
%% Cell type:markdown id: tags:
Here’s how you can decode it back to words, in case you’re curious.
%% Cell type:code id: tags:
``` python
word_index = reuters.get_word_index()
reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
decoded_newswire = " ".join(
# Note that the indices are offset by 3 because 0, 1,
# and 2 are reserved indices for “padding,”
# “start of sequence,” and “unknown.”
[reverse_word_index.get(i - 3, "?") for i in train_data[10]])
```
%% Cell type:code id: tags:
``` python
decoded_newswire
```
%%%% Output: execute_result
'? period ended december 31 shr profit 11 cts vs loss 24 cts net profit 224 271 vs loss 511 349 revs 7 258 688 vs 7 200 349 reuter 3'
%% Cell type:markdown id: tags:
The label associated with an example is an integer between 0 and 45 — a _topic_ index:
%% Cell type:code id: tags:
``` python
train_labels[10]
```
%%%% Output: execute_result
3
%% Cell type:code id: tags:
``` python
dataset = [decoded_newswire[1:-1]]
print(dataset)
```
%%%% Output: stream
[' period ended december 31 shr profit 11 cts vs loss 24 cts net profit 224 271 vs loss 511 349 revs 7 258 688 vs 7 200 349 reuter ']
%% Cell type:code id: tags:
``` python
from tensorflow.keras.layers import TextVectorization
text_vectorization = TextVectorization(
# Configures the layer to return sequences of words encoded
# as integer indices. There are several other output modes
# available, which you will see in action in a bit.
output_mode="int",
)
text_vectorization.adapt(dataset)
text_vectorization.get_vocabulary()
```
%%%% Output: execute_result
['',
'[UNK]',
'vs',
'profit',
'loss',
'cts',
'7',
'349',
'shr',
'revs',
'reuter',
'period',
'net',
'ended',
'december',
'688',
'511',
'31',
'271',
'258',
'24',
'224',
'200',
'11']
%% Cell type:markdown id: tags:
### TODO: Preparing the Data
Multi-hot encode your lists to turn them into vectors of 0s and 1s. This would
mean, for instance, turning the sequence $[8, 5]$ into a 10'000-dimensional vector
that would be all 0s except for indices 8 and 5, which would be 1s.
%% Cell type:code id: tags:
``` python
import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
for j in sequence:
results[i, j] = 1.
<< YOUR CODE HERE >>
return results
```
%% Cell type:code id: tags:
``` python
X_train = vectorize_sequences(train_data)
X_test = vectorize_sequences(test_data)
```
%% Cell type:markdown id: tags:
### TODO: One-Hot Encode the Labels
To vectorize the labels, there are two possibilities:
1. you can cast the label list as an integer tensor, or
2. you can use one-hot encoding.
One-hot encoding is a widely used format for categorical data, also called categorical encoding. In this case, one-hot encoding of
the labels consists of embedding each label as an all-zero vector with a 1 in the place of the label index.
%% Cell type:code id: tags:
``` python
from tensorflow.keras.utils import to_categorical
y_train = to_categorical(train_labels)
y_test = to_categorical(test_labels)
y_train = << YOUR CODE HERE >>
y_test = << YOUR CODE HERE >>
```
%% Cell type:markdown id: tags:
### TODO: Building your Fully Connected Neural Network Model
In this topic-classification we are trying to classify short snippets of text. The number of output classes is 46.
In a stack of Dense layers like those we’ve been using, each layer can only access information present in the output of the previous layer.
If one layer drops some information relevant to the classification problem, this information can never be recovered by later layers: each layer can potentially become an information bottleneck. To learn to separate 46 different classes:
too small layers may act as information bottlenecks, permanently dropping relevant information. For this reason we’ll use larger layers. Let’s go with 2 hidden layers each consisting of 64 units.
%% Cell type:markdown id: tags:
#### 1. Define Model
%% Cell type:code id: tags:
``` python
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(64, activation="relu"),
tf.keras.layers.Dense(46, activation="softmax")
<< YOUR CODE HERE >>
])
```
%% Cell type:markdown id: tags:
#### 2. Compile Model
%% Cell type:code id: tags:
``` python
model.compile(optimizer="rmsprop",
loss="categorical_crossentropy",
metrics=["accuracy"])
```
%% Cell type:markdown id: tags:
#### 3. Fit Model
%% Cell type:code id: tags:
``` python
X_val = X_train[:1000]
partial_X_train = X_train[1000:]
y_val = y_train[:1000]
partial_y_train = y_train[1000:]
history = model.fit(partial_X_train,
partial_y_train,
epochs=20,
batch_size=512,
validation_data=(X_val, y_val))
```
%%%% Output: stream
Epoch 1/20
16/16 [==============================] - 5s 243ms/step - loss: 2.6391 - accuracy: 0.5238 - val_loss: 1.7219 - val_accuracy: 0.6400
Epoch 2/20
16/16 [==============================] - 2s 155ms/step - loss: 1.4167 - accuracy: 0.7075 - val_loss: 1.3007 - val_accuracy: 0.7210
Epoch 3/20
16/16 [==============================] - 2s 141ms/step - loss: 1.0560 - accuracy: 0.7781 - val_loss: 1.1373 - val_accuracy: 0.7530
Epoch 4/20
16/16 [==============================] - 2s 121ms/step - loss: 0.8347 - accuracy: 0.8251 - val_loss: 1.0406 - val_accuracy: 0.7780
Epoch 5/20
16/16 [==============================] - 2s 152ms/step - loss: 0.6695 - accuracy: 0.8628 - val_loss: 1.0090 - val_accuracy: 0.7800
Epoch 6/20
16/16 [==============================] - 2s 112ms/step - loss: 0.5355 - accuracy: 0.8923 - val_loss: 0.9347 - val_accuracy: 0.8130
Epoch 7/20
16/16 [==============================] - 2s 114ms/step - loss: 0.4362 - accuracy: 0.9094 - val_loss: 0.9231 - val_accuracy: 0.8070
Epoch 8/20
16/16 [==============================] - 2s 133ms/step - loss: 0.3563 - accuracy: 0.9250 - val_loss: 0.9006 - val_accuracy: 0.8130
Epoch 9/20
16/16 [==============================] - 2s 111ms/step - loss: 0.2967 - accuracy: 0.9367 - val_loss: 1.0024 - val_accuracy: 0.7920
Epoch 10/20
16/16 [==============================] - 2s 136ms/step - loss: 0.2489 - accuracy: 0.9429 - val_loss: 0.9413 - val_accuracy: 0.8070
Epoch 11/20
16/16 [==============================] - 2s 117ms/step - loss: 0.2144 - accuracy: 0.9485 - val_loss: 0.9109 - val_accuracy: 0.8180
Epoch 12/20
16/16 [==============================] - 2s 106ms/step - loss: 0.1908 - accuracy: 0.9516 - val_loss: 0.9332 - val_accuracy: 0.8090
Epoch 13/20
16/16 [==============================] - 2s 111ms/step - loss: 0.1693 - accuracy: 0.9534 - val_loss: 0.9461 - val_accuracy: 0.8090
Epoch 14/20
16/16 [==============================] - 2s 112ms/step - loss: 0.1577 - accuracy: 0.9538 - val_loss: 0.9792 - val_accuracy: 0.8040
Epoch 15/20
16/16 [==============================] - 2s 107ms/step - loss: 0.1422 - accuracy: 0.9558 - val_loss: 1.0086 - val_accuracy: 0.8080
Epoch 16/20
16/16 [==============================] - 2s 113ms/step - loss: 0.1331 - accuracy: 0.9575 - val_loss: 1.0155 - val_accuracy: 0.8070
Epoch 17/20
16/16 [==============================] - 2s 106ms/step - loss: 0.1319 - accuracy: 0.9578 - val_loss: 1.0134 - val_accuracy: 0.8060
Epoch 18/20
16/16 [==============================] - 2s 107ms/step - loss: 0.1206 - accuracy: 0.9568 - val_loss: 1.0325 - val_accuracy: 0.8140
Epoch 19/20
16/16 [==============================] - 1s 93ms/step - loss: 0.1208 - accuracy: 0.9578 - val_loss: 1.0760 - val_accuracy: 0.8020
Epoch 20/20
16/16 [==============================] - 2s 106ms/step - loss: 0.1142 - accuracy: 0.9580 - val_loss: 1.1040 - val_accuracy: 0.8020
%% Cell type:code id: tags:
``` python
import matplotlib.pyplot as plt
loss = history.history["loss"]
val_loss = history.history["val_loss"]
epochs = range(1, len(loss) + 1)
plt.figure(figsize=(8, 8))
plt.plot(epochs, loss, label='Training Loss')
plt.plot(epochs, val_loss, label='Validation Loss')
plt.legend(loc='upper right')
plt.title('Training and Validation Loss')
```
%%%% Output: execute_result
Text(0.5, 1.0, 'Training and Validation Loss')
%%%% Output: display_data
![]()
%% Cell type:code id: tags:
``` python
acc = history.history["accuracy"]
val_acc = history.history["val_accuracy"]
plt.figure(figsize=(8, 8))
plt.plot(epochs, acc, label='Training Accuracy')
plt.plot(epochs, val_acc, label='Validation Accuracy')
plt.legend(loc='upper right')
plt.title('Training and Validation Accuracy')
plt.show()
```
%%%% Output: display_data
![]()
%% Cell type:markdown id: tags:
#### 4. Model Evaluation
%% Cell type:code id: tags:
``` python
results = model.evaluate(X_test, y_test)
print(results)
```
%%%% Output: stream
71/71 [==============================] - 1s 7ms/step - loss: 1.2564 - accuracy: 0.7845
[1.2563728094100952, 0.7845057845115662]
%% Cell type:markdown id: tags:
#### 5. Prediction
%% Cell type:code id: tags:
``` python
predictions = model.predict(X_test)
np.argmax(predictions[0])
```
%%%% Output: execute_result
3
%% Cell type:markdown id: tags:
# Exercise 2 - Recurrent Neural Networks for Prediction of Temperature Time Series
%% Cell type:markdown id: tags:
Throughout this exercise, all of our code examples will target a single problem: predicting the temperature 24 hours in the future, given a timeseries of hourly measurements of quantities such as atmospheric pressure and humidity, recorded over the recent past by a set of sensors on the roof of a building. As you will see, it’s a fairly challenging
problem!
We’ll use this temperature-forecasting task to highlight what makes timeseries data fundamentally different from the kinds of datasets you’ve encountered so far. You’ll see that densely connected networks and convolutional networks aren’t well-equipped to deal with this kind of dataset, while recurrent neural networks (RNNs) really shine on this type of problem.
We’ll work with a weather timeseries dataset recorded at the weather station at the Max Planck Institute for Biogeochemistry in Jena, Germany. In this dataset, 14 different quantities (such as temperature, pressure, humidity, wind direction, and so on) were recorded every 10 minutes over several years. The original data goes back to 2003, but the subset of the data we’ll download is limited to 2009–2016.
Let’s start by downloading and uncompressing the data:
%% Cell type:code id: tags:
``` python
!wget https://s3.amazonaws.com/keras-datasets/jena_climate_2009_2016.csv.zip
!unzip jena_climate_2009_2016.csv.zip
```
%% Cell type:markdown id: tags:
Now let’s inspect the data of the Jena weather dataset:
%% Cell type:code id: tags:
``` python
import os
fname = os.path.join("jena_climate_2009_2016.csv")
print(fname)
```
%%%% Output: stream
jena_climate_2009_2016.csv
%% Cell type:code id: tags:
``` python
with open(fname) as f:
data = f.read()
```
%% Cell type:code id: tags:
``` python
lines = data.split("\n")
print(lines[:2])
```
%%%% Output: stream
['"Date Time","p (mbar)","T (degC)","Tpot (K)","Tdew (degC)","rh (%)","VPmax (mbar)","VPact (mbar)","VPdef (mbar)","sh (g/kg)","H2OC (mmol/mol)","rho (g/m**3)","wv (m/s)","max. wv (m/s)","wd (deg)"', '01.01.2009 00:10:00,996.52,-8.02,265.40,-8.90,93.30,3.33,3.11,0.22,1.94,3.12,1307.75,1.03,1.75,152.30']
%% Cell type:markdown id: tags:
This outputs a count of 420'551 lines of data (each line is a timestep: a record of a
date and 14 weather-related values), as well as the following header:
%% Cell type:code id: tags:
``` python
header = lines[0].split(",")
lines = lines[1:]
print(header)
print(len(lines))
```
%%%% Output: stream
['"Date Time"', '"p (mbar)"', '"T (degC)"', '"Tpot (K)"', '"Tdew (degC)"', '"rh (%)"', '"VPmax (mbar)"', '"VPact (mbar)"', '"VPdef (mbar)"', '"sh (g/kg)"', '"H2OC (mmol/mol)"', '"rho (g/m**3)"', '"wv (m/s)"', '"max. wv (m/s)"', '"wd (deg)"']
420451
%% Cell type:markdown id: tags:
Now, convert all 420'451 lines of data into NumPy arrays: one array for the temperature
(in degrees Celsius), and another one for the rest of the data — the features we
will use to predict future temperatures. Note that we discard the `Date Time` column.
%% Cell type:code id: tags:
``` python
import numpy as np
temperature = np.zeros((len(lines),))
raw_data = np.zeros((len(lines), len(header) - 1))
print(temperature.shape)
print(raw_data.shape)
```
%%%% Output: stream
(420451,)
(420451, 14)
%% Cell type:code id: tags:
``` python
for i, line in enumerate(lines):
values = [float(x) for x in line.split(",")[1:]]
# We store column 1 in the “temperature” array.
temperature[i] = values[1]
# We store all columns (including the temperature) in the “raw_data” array.
raw_data[i, :] = values[:]
```
%% Cell type:code id: tags:
``` python
temperature[:2]
print(raw_data[:3])
```
%%%% Output: stream
[[ 9.96520e+02 -8.02000e+00 2.65400e+02 -8.90000e+00 9.33000e+01
3.33000e+00 3.11000e+00 2.20000e-01 1.94000e+00 3.12000e+00
1.30775e+03 1.03000e+00 1.75000e+00 1.52300e+02]
[ 9.96570e+02 -8.41000e+00 2.65010e+02 -9.28000e+00 9.34000e+01
3.23000e+00 3.02000e+00 2.10000e-01 1.89000e+00 3.03000e+00
1.30980e+03 7.20000e-01 1.50000e+00 1.36100e+02]
[ 9.96530e+02 -8.51000e+00 2.64910e+02 -9.31000e+00 9.39000e+01
3.21000e+00 3.01000e+00 2.00000e-01 1.88000e+00 3.02000e+00
1.31024e+03 1.90000e-01 6.30000e-01 1.71600e+02]]
%% Cell type:markdown id: tags:
The figure below shows the plot of temperature (in degrees Celsius) over time. On this plot,
you can clearly see the yearly periodicity of temperature — the data spans 8 years.
%% Cell type:code id: tags:
``` python
%matplotlib inline
from matplotlib import pyplot as plt
plt.plot(range(len(temperature)), temperature)
```