Showing posts with label tensorflow. Show all posts
Showing posts with label tensorflow. Show all posts

Sunday, February 07, 2021

Comparison of Text Augmentation Strategies for Spam Detection

Some time back, I found myself thinking of different data augmentation strategies for unbalanced datasets, i.e. datasets in which one or more classes are over-represented compared to the others, and wondering how these strategies stack up to one another. So I decided to set up a simple experiment to compare them. This post describes the experiment and its results.

The dataset I chose for this experiment was the SMS Spam Collection Dataset from Kaggle, a collection of almost 5600 text messages, consisting of 4825 (87%) ham and 747 (13%) spam messages. The network is a simple 3 layer fully connected network (FCN), whose input is a 512 element vector generated using the Google Universal Sentence Encoder (GUSE) against the text message, and outputs the argmax of a 2 element vector (representing "ham" or "spam"). The text augmentation strategies I considered in my experiment are as follows:

  • Baseline -- this is a baseline for result comparison. Since the task is binary classification, the metric we chose is Accuracy. We train the network for 10 epochs using Cross Entropy and the AdamW Optimizer with a learning rate of 1e-3.
  • Class Weights -- Class Weights attempt to address data imbalance by giving more weight to the minority class. Here we assign class weights to our optimizer proportional to the inverse of their counts in the training data.
  • Undersampling Majority Class -- in this scenario, we sample from the majority class the number of records in the minority class, and only use the sampled subset of the majority class plus the minority class for our training.
  • Oversampling Minority Class -- this is the opposite scenario, where we sample (with replacement) from the minority class a number of records that are equal to the number in the majority class. The sampled set will contain repetitions. We then use the sampled set plus the majority class for training.
  • SMOTE -- this is a variant on the previous strategy of oversampling the minority class. SMOTE (Synthetic Minority Oversampling TEchnique) ensures more heterogeneity in the oversampled minority class by creating synthetic records by interpolating between real records. SMOTE needs the input data to be vectorized.
  • Text Augmentation -- like the two previous approaches, this is another oversampling strategy. Heuristics and ontologies are used to make changes to the input text preserving its meaning as far as possible. I used the TextAttack, a Python library for text augmentation (and generating examples for adversarial attacks).

A few points to note here.

First, all the sampling methods, i.e., all the strategies listed above except for the Baseline and Class Weights, requires you to split your training data into training, validation, and test splits, before they are applied. Also, the sampling should be done only on the training split. Otherwise, you risk data leakage, where the augmented data leaks into the validation and test splits, giving you very optimistic results during model development which will invariably not hold as you move your model into production.

Second, augmenting your data using SMOTE can only be done on vectorized data, since the idea is to find and use points in feature hyperspace that are "in-between" your existing data. Because of this, I decided to pre-vectorize my text inputs using GUSE. Other augmentation approaches considered here don't need the input to be pre-vectorized.

The code for this experiment is divided into two notebooks.

  • blog_text_augment_01.ipynb -- In this notebook, I split the dataset into a train/validation/test split of 70/10/20, and generate vector representations for each text message using GUSE. I also oversample the minority class (spam) by generating approximately 5 augmentations for each record, and generate their vector representations as well.
  • blog_text_augment_02.ipynb -- I define a common network, which I retrain using Pytorch for each of the 6 augmentation scenarios listed above, and compare their accuracies.

Results are shown below, and seem to indicate that oversampling strategies tend to work the best, both the naive one and the one based on SMOTE. The next best choice seems to be class weights. This seems understandable because oversampling gives the network the most data to train with. That is probably also why undersampling doesn't work well. I was a bit surprised also that text augmentation strategies did not perform as well as the other oversampling strategies.

However, the differences here are quite small and possibly not really significant (note the y-axis in the bar chart is exagerrated (0.95 to 1.0) to highlight this difference). I also found that the results varied across multiple runs, probably resulting from different initialization scenarios. But overall the pattern shown above was the most common.

Edit 2021-02-13: @Yorko suggested using confidence intervals in order to address my above concern (see comments below), so I collected the results from 10 runs and computed the mean and standard deviation for each approach across all the runs. The updated bar chart above shows the mean value and has error bars of +/- 2 standard deviations off the mean result. Thanks to the error bars, we can now draw a few additional conclusions. First, we observe that SMOTE oversampling can indeed give better results than naive oversampling. It also shows that undersampling results can be very highly variable.

Saturday, December 19, 2020

First steps with Pytorch Lightning

Some time back, Quora routed a "Keras vs. Pytorch" question to me, which I decided to ignore because it seemed too much like flamebait to me. Couple of weeks back, after discussions with colleagues and (professional) acquaintances who had tried out libraries like Catalyst, Ignite, and Lightning, I decided to get on the Pytorch boilerplate elimination train as well, and tried out Pytorch Lightning. As I did so, my thoughts inevitably went back to the Quora question, and I came to the conclusion that, in their current form, the two libraries and their respective ecosystems are more similar than they are different, and that there is no technological reason to choose one over the other. Allow me to explain.

Neural networks learn using Gradient Descent. The central idea behind Gradient Descent can be neatly encapsulated in the equation below (extracted from the same linked Gradient Descent article), and is referred to as the "training loop". Of course, there are other aspects of neural networks, such as model and data definition, but it is the training loop where the differences in the earlier versions of the two libraries and their subsequent coming together are most apparent. So I will mostly talk about the training loop here.

Keras was initially conceived of as a high level API over the low level graph based APIs from Theano and Tensorflow. Graph APIs allow the user to first define the computation graph and then execute it. Once the graph is defined, the library will attempt to build the most efficient representation for the graph before execution. This makes the execution more efficient, but adds a lot of boilerplate to the code, and makes it harder to debug if something goes wrong. The biggest success of Keras in my opinion is its ability to hide the graph API almost completely behind an elegant API. In particular, its "training loop" looks like this:

1
2
model.compile(optimizer=optimizer, loss=loss_fn, metrics=[train_acc])
model.fit(Xtrain, ytrain, epochs=epochs, batch_size=batch_size)

Of course, the fit method has many other parameters as well, but at its most complex, it is a single line call. And, this is probably all that is needed for most simple cases. However, as networks get slightly more complex, with maybe multiple models or loss functions, or custom update rules, the only option for Keras used to be to drop down to the underlying Tensorflow or Theano code. In these situations, Pytorch appears really attractive, with the power, simplicity, and readability of its training loop.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
dataloader = DataLoader(Xtrain, batch_size=batch_size)
for epoch in epochs:
    for batch in dataloader:
        X, y = batch
        logits = model(X)
        loss = loss_fn(logits, y)

        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

        # aggregate metrics
        train_acc(logits, loss)

        # evaluate validation loss, etc.

However, with the release of Tensorflow 2.x, which included Keras as its default API through the tf.keras package, it is now possible to do something identical with Keras and Tensorflow as well.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
dataset = Dataset.from_tensor_slices(Xtrain).batch(batch_size)
for epoch in epochs:
    for batch in dataset:
        X, y = batch
        with tf.GradientTape as tape:
            logits = model(X)
            loss = loss_fn(y_pred=logits, y_true=y)
        grads = tape.gradient(loss, model.trainable_weights)
        optimizer.apply_gradients(zip(grads, model.trainable_weights))

        # aggregate metrics
        train_acc(logits, y)

In both cases, developers accept having to deal with some amount of boilerplate in return for additional power and flexibility. The approach taken by each of the three Pytorch add-on libraries I listed earlier, including Pytorch Lightning, is to create a Trainer object. The trainer models the training loop as an event loop with hooks into which specific functionality can be injected as callbacks. Functionality in these callbacks would be executed at specific points in the training loop. So a partial LightningModule subclass for our use case would look something like this, see the Pytorch Lightning Documentation or my code examples below for more details.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
class MyLightningModel(pl.LightningModule):
    def __init__(self, args):
        # same as Pytorch nn.Module subclass __init__()

    def forward(self, x):
        # same as Pytorch nn.Module subclass forward()

    def training_step(self, batch, batch_idx):
        x, y = batch
        logits = self.forward(x)
        loss = loss_fn(logits, y)
        acc = self.train_acc(logits, y)
        return loss

    def configure_optimizers(self):
        return self.optimizer

model = MyLightningModel()
trainer = pl.Trainer(gpus=1)
trainer.fit(model, dataloader)

If you think about it, this event loop strategy used by Lightning's trainer.fit() is pretty much how Keras manages to convert its training loop to a single line model.fit() call as well, its many parameters acting as the callbacks that control the training behavior. Pytorch Lightning is just a bit more explicit (and okay, a bit more verbose) about it. In effect, both libraries have solutions that address the other's pain points, so the only reason you would choose one or the other is personal or corporate preference.

In addition to callbacks for each of training, validation, and test steps, there are additional callbacks for each of these steps that will be called at the end of each step and epoch, for example: training_epoch_end() and training_step_end(). Another nice side effect of adopting something like Pytorch Lightning is that you get some of the default functionality of the event loop for free. For example, logging is done to Tensorboard by default, and progress bars are controlled using TQDM. Finally, (and that is the raison d'etre for Pytorch Lightning from the point of view of its developers) it helps you organize your Pytorch code.

To get familiar with Pytorch Lightning, I took three of my old notebooks, each dealing with training one major type of Neural Network architecture (from the old days) -- a fully connected, convolutional, and recurrent network, and converted it to use Pytorch Lightning. You may find it useful to look at, in addition to Pytorch Lightning's extensive documentation, including links to them below.

Saturday, November 14, 2020

ODSC trip report and Keras Tutorial

I attended ODSC (Open Data Science Conference) West 2020 end of last month. I also presented Keras from Soup to Nuts -- an example driven tutorial there, a 3-hour tutorial on Keras. Like other conferences this year, the event was all-virtual. Having attended one other all-virtual conference this year (Knowledge Discovery and Data Mining (KDD) 2020 and being part of organizing another (an in-house conference), I can appreciate how much work it took to pull it off. As with the other conferences, I continue to be impressed at how effortless it all appears to be from the point of view of both speaker and attendee, so kudos to the ODSC organizers and volunteers for a job well done!

In this post, I want to cover my general impressions about the conference for readers of this blog. Content seems similar to PyData, except that not all talks here are based on Python (or Julia or R) related. As with PyData, the content is mostly targeted at data scientists in industry, with a few talks that are more academic, based on the presenter's own research. I think there is also more coverage on the career related aspects of Data Science than PyData. I also thought that there was more content here than in typical PyData conferences -- the conference was 4 days long (Monday to Friday) and multi-track, with workshops and presentations. The variety of content feels a bit like KDD but with less academic rigor. Overall, the content is high-quality, and if you enjoy attending PyData conferences, you will find more than enough talks and workshops here to hold your interest through the duration of the conference.

Pricing is also a bit steep compared to KDD and PyData, although there seem to be deep discounts available if you qualify. You have to contact the organizers for details about the discounts. Fortunately I didn't have to worry about that since I was presenting and my ticket was complimentary.

Like KDD and unlike PyData, OSDC also does not share talk recordings with the public after the conference. Speakers sometimes do share their slides and github repositories, so hopefully you will find these resources for the talks I list below. Because my internal conference (the one I was part of the organizing team for) was scheduled the very next week, I could not spend as much time at ODSC as I would have liked, so there were many talks that I would have liked to attend but I didn't. Here is the full schedule (until the link is repurposed for the 2021 conference).

As I mentioned earlier already, I also presented a 3 hour tutorial on Keras, so I wanted to cover that in slightly greater detail for readers here as well. As implied by the name, and the talk abstract, the tutorial tries to teach participants enough Keras to become advanced Keras programmers, and assumes only some Python programming experience as a pre-requisite. Clearly 3 hours is not enough time, so the notebooks are deliberately short on theory and heavy on examples. I organized the tutorial into 3 45-minute sessions, with exercises at the end of the first two, but we ended up just running through the exercise solutions instead because of time constraints.

The tutorial materials are just a collection of Colab notebooks that are available at my sujitpal/keras-tutorial-odsc2020 github repository. The project README provides additional information about what each notebook contains. Each notebook is numbered with the session and sequence within each session. There are two notebooks called exercise 1 and 2, and corresponding solution notebooks titled exercise_1_solved and exercise_2_solved.

Keras started life as an easy to use high level API to Theano and Tensorflow, but has since been subsumed into Tensorflow 2.x as its default API. I was among those who learned Keras in its first incarnation, when certain things were just impossible to do in Keras, and the only option was to drop down to Tensorflow 1.x's two-step model (create compute graph and then run it with data). In many cases, Pytorch provided simpler ways to do the same thing, so for complex models I found myself increasingly gravitating towards Pytorch. I did briefly look at Keras (now tf.keras) and Tensorflow 2.0-alpha while co-authoring the Deep Learning with Tensorflow 2 and Keras book, but the software was new and there was not a whole lot information available at the time.

My point of mentioning all this is to acknowledge that I ended up learning a bit of advanced Keras myself as well when building the last few notebooks. Depending on where you are with Keras, you might find them interesting as well. Some of the interesting examples covered (according to me) are Sequence to Sequence models with and without attention, using transformers from the Huggingface Transformers library in your Keras models, using Cyclic Learning Rates and LR Finder, and distributed training across multiple GPUs and TPU. I am actually quite pleasantly surprised at how much more you can do with tf.keras with respect to the underlying Tensorflow framework, and I think you will be too (if you aren't already).

Monday, June 24, 2019

Understanding LR Finder and Cyclic Learning Rates using Tensorflow 2


A group of us at work are following Jeremy Howard's Practical Deep Learning for Coders, v3. Basically we watch the videos on our own, and come together once a fortnight or so, to discuss things that seemed interesting and useful, or if someone has questions that others might then try to answer. One thing that's covered fairly early on in the course is how to use the Learning Rate Finder (LR Finder) tool that comes built-in with the fast.ai library (fastai/fastai on github). The fast.ai library builds on top of the Pytorch framework, and provides convenience functions that can make deep learning development simpler. A lot like what Keras did for Tensorflow, which incidentally is also the Deep Learning framework that I started with and confess being somewhat partial to, although nowadays I use the tf.keras (Tensorflow) port exclusively. So I figured that it would be interesting to see how to do this (LR Finding) with Keras.

The LR Finder is the brainchild of Leslie Smith, who has published two papers on it -- A disciplined approach to neural network hyperparameters: Part 1 -- Learning Rate, Batch Size, Momentum, and Weight Decay, and another jointly with Nicholay Topin, Super-Convergence: Very Fast Training of Neural Networks using Large Learning Rates. The LR Finder approach is able to predict, with only a few iterations of training, a range of learning rates that would be optimal for a given model/dataset combination. It does this by varying the learning rate across the training iterations, and observing the loss for each learning rate. The shape of the plot of loss against learning rates provides clues about the optimal range of learning rates, much like stock charts provides clues to future prices of the stock.

This in itself is a pretty big time savings, compared to running limited epochs of training with different learning rates to find the "best" one. But in addition to that, Smith also proposes using a Learning Rate Schedule he calls the Cyclic Learning Rate (CLR). In its simplest incarnation, the learning rate schedule for CLR looks a bit like an isoceles triangle. Assuming N epochs of training, and an optimum learning rate range predicted by the LR Finder plot (min_lr, max_lr), for the first N/2 epochs, the learning rate rises uniformly from min_lr to max_lr, and for about 90% of the next N/2 epochs, it falls uniformly from max_lr to min_lr, then for the last 10%, it falls uniformly from min_lr to 0. According to his experiments on a wide variety of standard model/dataset combinations, using the CLR schedule trains networks results in higher classification accuracy often with fewer epochs of training, compared to using static learning rates.

There is already a LR Finder and CLR schedule implementation for Keras, thanks to Somshubhra Majumdar (titu1994/keras-one-cycle), where both the LR Finder and CLR Schedule (called OneCycleLR here) are implemented as Keras callbacks. There is a decidedly Keras flavor to the implementation. For example, the LR Finder always runs for one epoch of training. However, there are advantages to using this implementation compared to rolling your own. For example, unlike the LearningRateScheduler callback built into Keras, the OneCycleLR callback also optionally allows the caller to schedule a Momentum Schedule along with a Learning Rate Schedule. Overall, it would take some effort to convert over to tf.keras, but probably not a whole lot.

At the other end of the spectrum is a Pytorch implementation from David Silva (davidtvs/pytorch-lr-finder), where the LR Finder is more of a standalone utility, which can predict the optimum range of learning rates given a model/dataset combination. I felt this approach was a bit cleaner in the sense that one can focus on what the LR finder does rather than try to think in terms of callback events.

So I decided to use the pytorch-lr-finder as a model to build a basic LR Finder of my own that works against tf.keras on Tensorflow 2, and try it out against a small standard network to see how it works. For the CLR scheduler, I decided to pass a homegrown version of the CLR schedule into the built in tf.keras LearningRateScheduler. This post will describe that experience. However, since this was mainly a learning exercise, the code has not been tested beyond what I describe here, so for your own work, you should probably stick to using the more robust implementations I referenced above.

The network I decided on was the LeNet network, proposed by Yann LeCun in 1995. The actual implementation is based on Wei Li's Keras implementation available on the BIGBALLON/cifar-10-cnn repository. The class is defined as follows, using the new imperative Chainer like syntax adopted by Pytorch and now Tensorflow 2. I had originally assumed, like many others, that this syntax was one of the features that Tensorflow 2 was adopting from Pytorch, but it turns out that they are both adopting it from Chainer, as this Twitter thread from François Chollet indicates. In any case, convergence is a good thing for framework users like me. Talking of tweets from François Chollet, if you are comfortable with Keras already, here is another Twitter thread which tells you pretty much everything you need to know to get started with Tensorflow 2.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
class LeNetModel(tf.keras.Model):

    def __init__(self, **kwargs):
        super(LeNetModel, self).__init__(**kwargs)
        self.conv1 = tf.keras.layers.Conv2D(
            filters=6,
            kernel_size=(5, 5),
            padding="valid",
            activation="relu",
            kernel_initializer="he_normal",
            input_shape=(32, 32, 3))
        self.pool1 = tf.keras.layers.MaxPooling2D(
            pool_size=(2, 2))
        self.conv2 = tf.keras.layers.Conv2D(
            filters=16,
            kernel_size=(5, 5),
            padding="valid",
            activation="relu",
            kernel_initializer="he_normal")
        self.pool2 = tf.keras.layers.MaxPooling2D(
            pool_size=(2, 2),
            strides=(2, 2))
        self.flat = tf.keras.layers.Flatten()
        self.dense1 = tf.keras.layers.Dense(
            units=120,
            activation="relu",
            kernel_initializer="he_normal")
        self.dense2 = tf.keras.layers.Dense(
            units=84,
            activation="relu", 
            kernel_initializer="he_normal")
        self.dense3 = tf.keras.layers.Dense(
            units=10,
            activation="softmax",
            kernel_initializer="he_normal")


    def call(self, x):
        x = self.conv1(x)
        x = self.pool1(x)
        x = self.conv2(x)
        x = self.pool2(x)
        x = self.flat(x)
        x = self.dense1(x)
        x = self.dense2(x)
        x = self.dense3(x)
        return x        

Here is the Keras summary view for those of you who prefer something more visual. If you were wondering how I got actual values in the Output Shape column with the code above, I didn't. As Tensorflow Issue# 25036 indicates, the call() method creates a non-static graph, and so model.summary() is unable to compute the output shapes. To generate the summary below, I rebuilt the model as a static graph using tf.keras.models.Sequential(). The code is fairly trivial so I don't include it here.

Model: "le_net_model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv1 (Conv2D)               (None, 28, 28, 6)         456       
_________________________________________________________________
pool1 (MaxPooling2D)         (None, 14, 14, 6)         0         
_________________________________________________________________
conv2 (Conv2D)               (None, 10, 10, 16)        2416      
_________________________________________________________________
pool2 (MaxPooling2D)         (None, 5, 5, 16)          0         
_________________________________________________________________
flatten (Flatten)            (None, 400)               0         
_________________________________________________________________
dense1 (Dense)               (None, 120)               48120     
_________________________________________________________________
dense2 (Dense)               (None, 84)                10164     
_________________________________________________________________
dense3 (Dense)               (None, 10)                850       
=================================================================
Total params: 62,006
Trainable params: 62,006
Non-trainable params: 0
_________________________________________________________________

The dataset I used for the experiment was the CIFAR-10 dataset, a collection of 60K (32, 32, 3) color images (tiny images) in 10 different classes. The CIFAR-10 dataset is available via the tf.keras.datasets package. The function below downloads the data, preprocesses it appropriately for use by the network, and converts it into the tf.data.Dataset format that Tensorflow 2 likes. It will return datasets for training, validation, and test, with size 45K, 5K, and 10K images respectively.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def load_cifar10_data(batch_size):
    (xtrain, ytrain), (xtest, ytest) = tf.keras.datasets.cifar10.load_data()

    # scale data using MaxScaling
    xtrain = xtrain.astype(np.float32) / 255
    xtest = xtest.astype(np.float32) / 255

    # convert labels to categorical    
    ytrain = tf.keras.utils.to_categorical(ytrain)
    ytest = tf.keras.utils.to_categorical(ytest)

    train_dataset = tf.data.Dataset.from_tensor_slices((xtrain, ytrain))
    test_dataset = tf.data.Dataset.from_tensor_slices((xtest, ytest))

    # take out 10% of train data as validation data, shuffle, and batch
    val_size = xtrain.shape[0] // 10
    train_dataset = train_dataset.shuffle(10000)
    val_dataset = train_dataset.take(val_size).batch(
        batch_size, drop_remainder=True)
    train_dataset = train_dataset.skip(val_size).batch(
        batch_size, drop_remainder=True)
    test_dataset = test_dataset.shuffle(10000).batch(
        batch_size, drop_remainder=True)
    
    return train_dataset, val_dataset, test_dataset

I trained the model first using a learning rate of 0.001, which I picked up from the blog post CIFAR-10 Image Classification in Tensorflow by Park Chansung. The training code is just 5-6 lines of code that is very familiar to Keras developers - declare the model, compile the model with loss function and optimizer, then train it for a fixed number of epochs (10), and finally evaluate it against the held out test dataset.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
model = LeNetModel()
model.build(input_shape=(None, 32, 32, 3))

learning_rate = 0.001
optimizer = tf.keras.optimizers.SGD(learning_rate=learning_rate)
loss_fn = tf.keras.losses.CategoricalCrossentropy()

model.compile(loss=loss_fn, optimizer=optimizer, metrics=["accuracy"])
model.fit(train_dataset, epochs=10, validation_data=val_dataset)
model.evaluate(test_dataset)

The output of this run is shown below. The first block is the output from the training run (model.fit()), and the last line is the output of the model.evaluate() call. As you can see, while the final accuracy values are not stellar, it is steadily increasing, so presumably we can expect good results given enough epochs of training. Also, the objective of this run was to create a baseline against which we will measure training runs with learning rates that we infer from our LR finder, described below.

Epoch 1/10
351/351 [==============================] - 12s 35ms/step - loss: 2.3043 - accuracy: 0.1105 - val_loss: 2.2936 - val_accuracy: 0.1170
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 2.2816 - accuracy: 0.1276 - val_loss: 2.2801 - val_accuracy: 0.1330
Epoch 3/10
351/351 [==============================] - 12s 33ms/step - loss: 2.2682 - accuracy: 0.1464 - val_loss: 2.2668 - val_accuracy: 0.1442
Epoch 4/10
351/351 [==============================] - 11s 33ms/step - loss: 2.2517 - accuracy: 0.1620 - val_loss: 2.2474 - val_accuracy: 0.1621
Epoch 5/10
351/351 [==============================] - 12s 33ms/step - loss: 2.2254 - accuracy: 0.1856 - val_loss: 2.2141 - val_accuracy: 0.1893
Epoch 6/10
351/351 [==============================] - 12s 34ms/step - loss: 2.1810 - accuracy: 0.2117 - val_loss: 2.1601 - val_accuracy: 0.2226
Epoch 7/10
351/351 [==============================] - 12s 34ms/step - loss: 2.1144 - accuracy: 0.2421 - val_loss: 2.0856 - val_accuracy: 0.2526
Epoch 8/10
351/351 [==============================] - 12s 35ms/step - loss: 2.0363 - accuracy: 0.2641 - val_loss: 2.0116 - val_accuracy: 0.2714
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.9704 - accuracy: 0.2841 - val_loss: 1.9583 - val_accuracy: 0.2901
Epoch 10/10
351/351 [==============================] - 13s 36ms/step - loss: 1.9243 - accuracy: 0.2991 - val_loss: 1.9219 - val_accuracy: 0.2985

78/78 [==============================] - 1s 13ms/step - loss: 1.9079 - accuracy: 0.3112

My version of the LR Finder presents an API similar to the pytorch-lr-finder, where you pass in the model, optimizer, loss function, and dataset to create an instance of LRFinder. You then make call range_test() on the LRFinder with the minimum and maximum boundaries for learning rate, and the number of iterations. This step is similar to the Learner.lr_find() call in fast.ai. The range_test() function will split the learning rate range into the specified number of iterations given by num_iter, and train the model with one batch with each learning rate, and record the loss. Finally, the plot() method will plot the losses against the learning rate. Since we are training at the batch level, we need to calculate losses and gradients ourselves, as seen in the train_step() function. The code for the LRFinder class is as follows. The main section (under if __name__ == "__main__") contains calling code using the LeNet model, CIFAR-10 dataset, the SGD optimizer, and the categorical cross-entropy loss function.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

class LRFinder(object):
    def __init__(self, model, optimizer, loss_fn, dataset):
        super(LRFinder, self).__init__()
        self.model = model
        self.optimizer = optimizer
        self.loss_fn = loss_fn
        self.dataset = dataset
        # placeholders
        self.lrs = None
        self.loss_values = None
        self.min_lr = None
        self.max_lr = None
        self.num_iters = None


    @tf.function
    def train_step(self, x, y, curr_lr):
        tf.keras.backend.set_value(self.optimizer.lr, curr_lr)
        with tf.GradientTape() as tape:
            # forward pass
            y_ = self.model(x)
            # external loss value for this batch
            loss = self.loss_fn(y, y_)
            # add any losses created during forward pass
            loss += sum(self.model.losses)
            # get gradients of weights wrt loss
            grads = tape.gradient(loss, self.model.trainable_weights)
        # update weights
        self.optimizer.apply_gradients(zip(grads, self.model.trainable_weights))
        return loss


    def range_test(self, min_lr, max_lr, num_iters, debug):
        # create learning rate schedule
        self.min_lr = min_lr
        self.max_lr = max_lr
        self.num_iters = num_iters
        self.lrs =  np.linspace(
            self.min_lr, self.max_lr, num=self.num_iters)
        # initialize loss_values
        self.loss_values = []
        curr_lr = self.min_lr
        for step, (x, y) in enumerate(self.dataset):
            if step >= self.num_iters:
                break
            loss = self.train_step(x, y, curr_lr)
            self.loss_values.append(loss.numpy())
            if debug:
                print("[DEBUG] Step {:d}, Loss {:.5f}, LR {:.5f}".format(
                    step, loss.numpy(), self.optimizer.learning_rate.numpy()))
            curr_lr = self.lrs[step]


    def plot(self):
        plt.plot(self.lrs, self.loss_values)
        plt.xlabel("learning rate")
        plt.ylabel("loss")
        plt.title("Learning Rate vs Loss ({:.2e}, {:.2e}, {:d})"
            .format(self.min_lr, self.max_lr, self.num_iters))
        plt.xscale("log")
        plt.grid()
        plt.show()


if __name__ == "__main__":

    tf.random.set_seed(42)

    # model
    model = LeNetModel()
    model.build(input_shape=(None, 32, 32, 3))
    # optimizer
    optimizer = tf.keras.optimizers.SGD()
    # loss_fn
    loss_fn = tf.keras.losses.CategoricalCrossentropy()
    # dataset
    batch_size = 128
    dataset, _, _ = load_cifar10_data(batch_size)
    # min_lr, max_lr
#    min_lr = 1e-6
#    max_lr = 3
    min_lr = 1e-2
    max_lr = 1
    # compute num_iters (Keras fit-one-cycle used 1 epoch by default)
    dataset_len = 45000
    batch_size = 128
    num_iters = dataset_len // batch_size
    # declare LR Finder
    lr_finder = LRFinder(model, optimizer, loss_fn, dataset)
    lr_finder.range_test(min_lr, max_lr, num_iters, debug=True)
    lr_finder.plot()

We first ran the LRFinder for a relatively large learning rate range from 1e-6 to 3. This gives us the chart on the left below. For the CLR schedule, the minimum LR for our range is where the loss starts descending, and the maximum LR is where the loss stops descending or becomes ragged. My charts are not as clean as those shown in the two projects referenced, but we can still infer that these boundaries are 1e-6 and about 3e-1. The chart on the right below is the plot of LR vs Loss on a smaller range (1e-2 and 1) to help see the chart in greater detail. We also see that the LR with minimum loss is about 3e-1.






Based on this, my first experiment is to try and train the network with the larger best learning rate (3e-1) we found from the LR Finder, and see if it trains better over 10 epochs than my previous attempt. The only thing we have changed here from the previous training code block above is to replace learning_rate from 0.001 to 0.3. Here are the results of 10 epochs of training, followed by evaluation on the held out test set.

Epoch 1/10
351/351 [==============================] - 12s 36ms/step - loss: 2.1572 - accuracy: 0.1664 - val_loss: 1.9966 - val_accuracy: 0.2590
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 1.8960 - accuracy: 0.2979 - val_loss: 1.7568 - val_accuracy: 0.3732
Epoch 3/10
351/351 [==============================] - 12s 35ms/step - loss: 1.7456 - accuracy: 0.3642 - val_loss: 1.6556 - val_accuracy: 0.3984
Epoch 4/10
351/351 [==============================] - 12s 35ms/step - loss: 1.6634 - accuracy: 0.4021 - val_loss: 1.6050 - val_accuracy: 0.4331
Epoch 5/10
351/351 [==============================] - 12s 35ms/step - loss: 1.5993 - accuracy: 0.4213 - val_loss: 1.6906 - val_accuracy: 0.3858
Epoch 6/10
351/351 [==============================] - 12s 36ms/step - loss: 1.5244 - accuracy: 0.4484 - val_loss: 1.5754 - val_accuracy: 0.4399
Epoch 7/10
351/351 [==============================] - 13s 36ms/step - loss: 1.4568 - accuracy: 0.4749 - val_loss: 1.4996 - val_accuracy: 0.4712
Epoch 8/10
351/351 [==============================] - 13s 36ms/step - loss: 1.3894 - accuracy: 0.4971 - val_loss: 1.4854 - val_accuracy: 0.4786
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.3323 - accuracy: 0.5207 - val_loss: 1.4527 - val_accuracy: 0.4950
Epoch 10/10
351/351 [==============================] - 12s 36ms/step - loss: 1.2817 - accuracy: 0.5411 - val_loss: 1.4320 - val_accuracy: 0.5068

78/78 [==============================] - 1s 12ms/step - loss: 1.4477 - accuracy: 0.4920

Clearly, the larger learning rate is helping the network achieve better performance, although it does seem (at least around epoch 3) that it may be slightly too large. Accuracy numbers on the held out test set jumped from 0.3112 to 0.4920. So overall it seems to be helping. So even if we just use the LR Finder to find the "best" learning rate, this is still cheaper than doing multiple training runs of a few epochs each.

Finally, we will try using a Cyclic Learning Rate (CLR) schedule using the learning rate boundaries (1e-6, 3e-1). The code for this is shown below. The clr_schedule() function produces a triangular learning rate schedule which rises for the first 5 epochs (in our case) from the minimum specified learning rate to the maximum, then falls from the maximum to the minimum for the next 4 epochs, and finally falls to half the minimum for the last epoch. This is analogous to the Learner.fit_one_cycle() call in fast.ai. The clr_schedule function is passed to the LearningRateScheduler callback, which is then called by the model training loop via the callback parameter in the fit() function call. Here is the code.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf

def clr_schedule(epoch):
    num_epochs = 10
    mid_pt = int(num_epochs * 0.5)
    last_pt = int(num_epochs * 0.9)
    min_lr, max_lr = 1e-6, 3e-1
    if epoch <= mid_pt:
        return min_lr + epoch * (max_lr - min_lr) / mid_pt
    elif epoch > mid_pt and epoch <= last_pt:
        return max_lr - ((epoch - mid_pt) * (max_lr - min_lr)) / mid_pt
    else:
        return min_lr / 2

# plot the points
# epochs = [x+1 for x in np.arange(10)]
# clrs = [clr_schedule(x) for x in epochs]
# plt.plot(epochs, clrs)
# plt.show()

batch_size = 128
train_dataset, val_dataset, test_dataset = load_cifar10_data(batch_size)

model = LeNetModel()

min_lr = 1e-6
max_lr = 3e-1
optimizer = tf.keras.optimizers.SGD(learning_rate=min_lr)
loss_fn = tf.keras.losses.CategoricalCrossentropy()

lr_scheduler = tf.keras.callbacks.LearningRateScheduler(clr_schedule)

model.compile(loss=loss_fn, optimizer=optimizer, metrics=["accuracy"])
model.fit(train_dataset, epochs=10, validation_data=val_dataset,
    callbacks=[lr_scheduler])
model.evaluate(test_dataset)

And here are the results of training the LeNet model with the CIFAR-10 dataset for 10 epochs, and evaluating on the held out test set. As you can see, the evaluation accuracy on the held out test set has jumped further from 0.4920 to 0.5469.

Epoch 1/10
351/351 [==============================] - 13s 36ms/step - loss: 2.7753 - accuracy: 0.0993 - val_loss: 2.7627 - val_accuracy: 0.1060
Epoch 2/10
351/351 [==============================] - 12s 34ms/step - loss: 2.0131 - accuracy: 0.2097 - val_loss: 2.0829 - val_accuracy: 0.2634
Epoch 3/10
351/351 [==============================] - 12s 34ms/step - loss: 1.8321 - accuracy: 0.3106 - val_loss: 1.7187 - val_accuracy: 0.3718
Epoch 4/10
351/351 [==============================] - 12s 35ms/step - loss: 1.7100 - accuracy: 0.3648 - val_loss: 1.7484 - val_accuracy: 0.3928
Epoch 5/10
351/351 [==============================] - 12s 36ms/step - loss: 1.5779 - accuracy: 0.4209 - val_loss: 1.6188 - val_accuracy: 0.4087
Epoch 6/10
351/351 [==============================] - 13s 36ms/step - loss: 1.5451 - accuracy: 0.4300 - val_loss: 1.5704 - val_accuracy: 0.4377
Epoch 7/10
351/351 [==============================] - 12s 35ms/step - loss: 1.3597 - accuracy: 0.5063 - val_loss: 1.3742 - val_accuracy: 0.5014
Epoch 8/10
351/351 [==============================] - 12s 36ms/step - loss: 1.2383 - accuracy: 0.5484 - val_loss: 1.3620 - val_accuracy: 0.5204
Epoch 9/10
351/351 [==============================] - 12s 35ms/step - loss: 1.1379 - accuracy: 0.5856 - val_loss: 1.3336 - val_accuracy: 0.5391
Epoch 10/10
351/351 [==============================] - 12s 35ms/step - loss: 1.0564 - accuracy: 0.6130 - val_loss: 1.3008 - val_accuracy: 0.5557

78/78 [==============================] - 1s 13ms/step - loss: 1.3043 - accuracy: 0.5469

This indicates that the LR Finder and CLR schedule seem like good ideas to try when training your models, especially when using non-adaptive optimizers such as SGD.

I tried the same sequence of experiments with the Adam optimizer, and I got better results (test accuracy: 0.5908) using a fixed learning rate of 1e-3 for the first training run. Thereafter, based on the LR Finder reporting a learning rate range of (1e-6, 1e-1), the next two experiments using the best learning rate and CLR schedule both produced accuracies of about 0.1 (i.e., close to random for 10-class classifier). I wasn't too surprised, since I figured that the CLR schedule probably interfered with Adam's own learning rate schedule. However, according to this tweet from Jeremy Howard, the LR Finder can be used with the Adam optimizer as well. Given that he has probably conducted many more experiments around this than I have, and the fast.ai Learner.lr_find() code is more robust and heavily tested than my homegrown implementation, he is very likely right, and my results are an anomaly.

That's all I have for today. Thanks for staying with me so far. I learned a lot from implementing this code, and hopefully you learned a few things from reading this as well. Hopefully, this gives you some ideas for building an LR Finder for Tensorflow 2 that can be used easily by end-users -- if you do end up building one, please let me know, will be happy to link to your site/repository and recommend it to other readers.

Saturday, April 06, 2019

Matrix Factorization as Gradient Descent using Tensorflow 2.x


Last month, at the Tensorflow Dev Summit 2019, Google announced the release of Tensorflow 2.0-alpha, which I thought was something of a quantum jump in terms of its evolution. The biggest change in my opinion was the switch to using eager mode of execution as default. The next big thing is the adoption of Keras as the primary high level (tf.keras) API. The low level session based API still remains, and can be used to build components that can interoperate with components built using the tf.keras API.

Other big changes is the introduction of the new tf.data package for building input pipelines, new distribution strategies that allow your code to be run without change on your laptop as well as multi-GPU or TPU environments, better interop with tf.Estimator for your tf.keras models, and a single SavedModel format for saving models that works across the Tensorflow ecosystem (Tensorflow Hub, Tensorflow Serving, etc).

None of these features are brand new of course. But together, they make Tensorflow more attractive to me. As someone who started with Keras and gradually moved to Tensorflow 1.x because its part of the same ecosystem, I had a love-hate relationship with it. I couldn't afford not to use it because of its reach and power, but at the same time, I found the API complex and hard to debug, and I didn't particularly enjoy working with it. In comparison, I found Pytorch's API much more intuitive and fun to work with. TF 2.0 code looks a lot like Pytorch to me, and as a user, I like this a lot.

At work, we have formed an interest group to teach each other Deep Learning, sort of similar to the AI & Deep Learning Enthusiasts meetup group I was part of few years ago. At the meetup, we would come together one day a week to watch videos of various DL lectures and discuss afterwards. The interest group functions similarly so far, except that we are geographically distributed, so its not possible to watch the videos together, so we watch in advance and then get together weekly to discuss what we learned. Currently we are watching Practical Deep Learning for Coders, v3, taught by Jeremy Howard and Rachel Thomas of fast.ai.

Lecture 4 of this course was about Recommender Systems, and one of the examples was how to use Pytorch's optimizers to do Matrix Factorization using Gradient Descent. I had learned a similar technique in the Matrix Factorization and Advanced Techniques mini-course at Coursera, taught by Profs Michael Ekstrand and Joseph Konstan, and part of the Recommendation Systems specialization for which they are better known. At the time I had started to implement some of these techniques and even blogged about it, but ended up never implementing this particular technique, so I figured that this might be a good way to get my hands dirty with a little TF 2.x programming. So that's what I did, and this is what I am going to talk about today.

Matrix Factorization is the process of decomposing a matrix into (in this case) two matrices, which when multiplied back yields an approximation of the original matrix. In the context of a movie Recommendation Systems, the input X is the ratings matrix, a (num_users x num_movies) sparse matrix. Sparse because most users haven't rated most movies. Matrix Factorization would split them into a pair of matrices M and U of shapes (num_movies x k) and (num_users x k) respectively, representing movies and users respectively. Here k represents a latent variable that encodes a movie or user -- thus a movie or user can now be represented by a vector of size k.


As a practical matter, we also want to factor in that people are different and might rate a movie differently, even if they feel the same way about it. Similarly, movies are different, and the same rating for different movies doesn't imply that the rating is identical. So we factor out these biases and call them the user bias bU, movie (or item) bias bM and global bias bG. This is shown in the second equation above, and this is the formulation we will implement below.

In order to model this problem as a Gradient Descent problem, we can start with random matrices U and M, random vectors bU and bM, and a random scalar bG. We then attempt to compute an approximation X̂ of the ratings matrix X by composing these random tensors as shown in the second equation, and passing the result through a non-linear activation (sigmoid in our case). We then compute the loss as the mean square error between X and X̂ and then update the random tensors by the gradients of the loss with respect to each tensor. We continue this process until the loss is below some acceptable threshold.

Here is the partial code for doing the matrix factorization. I have created two classes - one for the MatrixFactorization layer and another for the MatrixFactorizer network consisting of the custom MatrixFactorization layer and a Sigmoid Activation layer. Notice the similarity between the MatrixFactorizer's call() method and Pytorch's forward(). The loss function is the MeanSquaredError, and the optimizer used is RMSprop. The training loop loops 5000 times, at each step, the network recomputes X̂ and the loss between this and the original X tensor. The GradientTape automatically computes the gradient of the loss w.r.t. each variable and the optimizer updates the variables.

One thing different between the implementation suggested in the Matrix Factorization course and the Fast.AI course is the Sigmoid activation layer. Since the Sigmoid squeezes its input into the range [0, 1], and our ratings are in the range [0, 5], we need to scale our input data accordingly. This has been done in the load_data() function which is not shown here in the interests of space. You can find the full code for the matrix factorization and prediction modules in github.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class MatrixFactorization(tf.keras.layers.Layer):
    def __init__(self, emb_sz, **kwargs):
        super(MatrixFactorization, self).__init__(**kwargs)
        self.emb_sz = emb_sz

    def build(self, input_shape):
        num_users, num_movies = input_shape
        self.U = self.add_variable("U", 
            shape=[num_users, self.emb_sz], 
            dtype=tf.float32,
            initializer=tf.initializers.GlorotUniform)
        self.M = self.add_variable("M", 
            shape=[num_movies, self.emb_sz],
            dtype=tf.float32, 
            initializer=tf.initializers.GlorotUniform)
        self.bu = self.add_variable("bu",
            shape=[num_users],
            dtype=tf.float32, 
            initializer=tf.initializers.Zeros)
        self.bm = self.add_variable("bm",
            shape=[num_movies],
            dtype=tf.float32, 
            initializer=tf.initializers.Zeros)
        self.bg = self.add_variable("bg", 
            shape=[],
            dtype=tf.float32,
            initializer=tf.initializers.Zeros)

    def call(self, input):
        return (tf.add(
            tf.add(
                tf.matmul(self.U, tf.transpose(self.M)),
                tf.expand_dims(self.bu, axis=1)),
            tf.expand_dims(self.bm, axis=0)) +
            self.bg)


class MatrixFactorizer(tf.keras.Model):
    def __init__(self, embedding_size):
        super(MatrixFactorizer, self).__init__()
        self.matrixFactorization = MatrixFactorization(embedding_size)
        self.sigmoid = tf.keras.layers.Activation("sigmoid")

    def call(self, input):
        output = self.matrixFactorization(input)
        output = self.sigmoid(output)
        return output


def loss_fn(source, target):
    mse = tf.keras.losses.MeanSquaredError()
    loss = mse(source, target)
    return loss


DATA_DIR = Path("../../data")
MOVIES_FILE = DATA_DIR / "movies.csv"
RATINGS_FILE = DATA_DIR / "ratings.csv"
WEIGHTS_FILE = DATA_DIR / "mf-weights.h5"

EMBEDDING_SIZE = 15

X, user_idx2id, movie_idx2id = load_data()

model = MatrixFactorizer(EMBEDDING_SIZE)
model.build(input_shape=X.shape)
model.summary()

optimizer = tf.optimizers.RMSprop(learning_rate=1e-3, momentum=0.9)

losses, steps = [], []
for i in range(5000):
    with tf.GradientTape() as tape:
        Xhat = model(X)
        loss = loss_fn(X, Xhat)
        if i % 100 == 0:
            loss_value = loss.numpy()
            losses.append(loss_value)
            steps.append(i)
            print("step: {:d}, loss: {:.3f}".format(i, loss_value))
    variables = model.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

# plot training loss
plt.plot(steps, losses, marker="o")
plt.xlabel("steps")
plt.ylabel("loss")
plt.show()

The chart below shows the loss plotted across 5000 training steps. As can be seen, the loss falls quickly and then flattens out.


On the prediction side, we can now use the factorized matrices M and U as embeddings for movies and users respectively. We have chosen k=15, so effectively, we can now describe a movie or a user in terms of a vector of 15 latent features. So it is now possible to find movies similar to a given movie by simply doing a dot product of its vector with all the other vectors, and reporting the top N movies whose vectors yield the highest dot product.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# content based: find movies similar to given movie
# Batman & Robin (1997) -- movie_id = 1562
print("*** movies similar to given movie ***")
TOP_N = 10
movie_idx = np.argwhere(movie_idx2id == 1562)[0][0]
source_vec = np.expand_dims(M[movie_idx], axis=1)
movie_sims = np.matmul(M, source_vec)
similar_movie_ids = np.argsort(-movie_sims.reshape(-1,))[0:TOP_N]
baseline_movie_sim = None
for smid in similar_movie_ids:
    movie_id = movie_idx2id[smid]
    title, genres = movie_id2title[movie_id]
    genres = genres.replace('|', ', ')
    movie_sim = movie_sims[smid][0]
    if baseline_movie_sim is None:
        baseline_movie_sim = movie_sim
    movie_sim /= baseline_movie_sim
    print("{:.5f} {:s} ({:s})".format(movie_sim, title, genres))

This gives us the following results for the top 10 movies our model thinks is similar to Batman & Robin.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
*** movies similar to given movie ***
1.00000 Batman & Robin (1997) (Action, Adventure, Fantasy, Thriller)
0.83674 Apollo 13 (1995) (Adventure, Drama, IMAX)
0.81504 Island, The (2005) (Action, Sci-Fi, Thriller)
0.72535 Rescuers Down Under, The (1990) (Adventure, Animation, Children)
0.69084 Crow: City of Angels, The (1996) (Action, Thriller)
0.68452 Lord of the Rings: The Two Towers, The (2002) (Adventure, Fantasy)
0.65504 Saving Private Ryan (1998) (Action, Drama, War)
0.64059 Cast Away (2000) (Drama)
0.63650 Fugitive, The (1993) (Thriller)
0.61579 My Neighbor Totoro (Tonari no Totoro) (1988) (Animation, Children, Drama, Fantasy)

Similarly, we can recommend new movies or predict ratings for them for a given user by scanning across the row for the approximate matrix X̂ and finding the ones with the highest values. Note that we need to remove the biases from our computed X̂ matrix and rescale so the predicted ratings for our recommendations are comparable to the original ratings.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# collaborative filtering based: find movies for user
# user: 121403 has rated 29 movies, we will identify movie
# recommendations for this user that they haven't rated
print("*** top movie recommendations for user ***")
USER_ID = 121403
user_idx = np.argwhere(user_idx2id == USER_ID)
Xhat = (
    np.add(
        np.add(
            np.matmul(U, M.T), 
            np.expand_dims(-bu, axis=1)
        ),
        np.expand_dims(-bm, axis=0)
    ) - bg)
scaler = MinMaxScaler()
Xhat = scaler.fit_transform(Xhat)
Xhat *= 5

user_preds = Xhat[user_idx].reshape(-1)
pred_movie_idxs = np.argsort(-user_preds)

print("**** already rated (top {:d}) ****".format(TOP_N))
mids_already_rated = set([mid for (mid, rating) in uid2mids[USER_ID]])
ordered_mrs = sorted(uid2mids[USER_ID], key=operator.itemgetter(1), reverse=True)
for mid, rating in ordered_mrs[0:TOP_N]:
    title, genres = movie_id2title[mid]
    genres = genres.replace('|', ', ')
    pred_rating = user_preds[np.argwhere(movie_idx2id == mid)[0][0]]
    print("{:.1f} ({:.1f}) {:s} ({:s})".format(rating, pred_rating, title, genres))
print("...")
print("**** movie recommendations ****")
top_recommendations = []
for movie_idx in pred_movie_idxs:
    movie_id = movie_idx2id[movie_idx]
    if movie_id in mids_already_rated:
        continue
    pred_rating = user_preds[movie_idx]
    top_recommendations.append((movie_id, pred_rating))
    if len(top_recommendations) > TOP_N:
        break
for rec_movie_id, pred_rating in top_recommendations:
    title, genres = movie_id2title[rec_movie_id]
    genres = genres.replace('|', ', ')
    print("{:.1f} {:s} ({:s})".format(pred_rating, title, genres))

This gives us the following top 10 recommendations for this user, along with the predicted ratings that the user might give each movie.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
*** top movie recommendations for user ***
4.8 Beverly Hills Cop (1984) (Action, Comedy, Crime, Drama)
4.8 Finding Neverland (2004) (Drama)
4.8 Nightmare Before Christmas, The (1993) (Animation, Children, Fantasy, Musical)
4.8 Conan the Barbarian (1982) (Action, Adventure, Fantasy)
4.8 Meet the Parents (2000) (Comedy)
4.8 Stranger than Fiction (2006) (Comedy, Drama, Fantasy, Romance)
4.8 Halloween H20: 20 Years Later (Halloween 7: The Revenge of Laurie Strode) (1998) (Horror, Thriller)
4.8 Forever Young (1992) (Drama, Romance, Sci-Fi)
4.8 Escape to Witch Mountain (1975) (Adventure, Children, Fantasy)
4.8 Blue Lagoon, The (1980) (Adventure, Drama, Romance)
4.8 Needful Things (1993) (Drama, Horror)

As a test, for a subset of movies the user has already rated, we compute the predicted ratings using our X̂ matrix. Here the first column is the actual rating, and the number in parenthesis is the predicted rating.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
**** already rated (top 10) ****
5.0 (4.8) Matrix, The (1999) (Action, Sci-Fi, Thriller)
5.0 (4.8) Kill Bill: Vol. 1 (2003) (Action, Crime, Thriller)
5.0 (4.8) V for Vendetta (2006) (Action, Sci-Fi, Thriller, IMAX)
5.0 (4.8) Planet Terror (2007) (Action, Horror, Sci-Fi)
4.5 (4.8) Pulp Fiction (1994) (Comedy, Crime, Drama, Thriller)
4.5 (4.8) Schindler's List (1993) (Drama, War)
4.5 (4.8) Silence of the Lambs, The (1991) (Crime, Horror, Thriller)
4.5 (4.8) Reservoir Dogs (1992) (Crime, Mystery, Thriller)
4.5 (4.8) Ghostbusters (a.k.a. Ghost Busters) (1984) (Action, Comedy, Sci-Fi)
4.5 (4.8) Snatch (2000) (Comedy, Crime, Thriller)
...

This concludes my example of using TF 2.0 to implement Matrix Factorization using Gradient Descent. Wanted to give a quick shout out to Tony Holdroyd for writing the Tensorflow 2.0 Quick Start Guide and for PackT for publishing it at the opportune time. I had the good fortune of reviewing the book at around the same time as I was looking at new features of TF 2.0, and that reduced my learning curve to a great extent.